Perl Foundation News: September 2014 Grant Votes

The Grants Committee has concluded the voting of the September round.

Proposals in this round

Voting Results

TitleYesNoScore
Nile07
IO::All Redux439 = 5+2+1+1
Inline::C(PP) ...8029 = 5+4+4+4+3+3+3+3
Pegex Grammar ...16
Swim Pod09

Definition of the score is found in 3.2 of the rules.

Details

Nile - Visual Web App Framework Separating Code From Design Multi Lingual And Multi Theme

First of all, thank you for submitting this application. It's always good to see new faces.

For this grant, we are unable to find a reason to spend a large portion of the community money on this new web framework at this point. We would like to see the author continue the development of Nile and get some more community support on it first.

IO::All Redux

We approve this grant. We will not fund this grant at this point and will re-evaluate this in the next round.

Inline::C(PP) Module Support

We are pleased that the Grants Committee will approve and fund this grant. This will be useful for the community and we already saw enough positive feedback. It should be noted that the total score was the second highest this year after the Perl::Lint grant.

Mark Jensen agreed to be the grant manager.

Pegex Grammar for YAML

We understand YAML is used in a number of places and the grant has certain value. However at this point we are not able to approve it.

Swim Pod

We hope we can discuss this grant after Swim gets wider adoption and community support.

Next round

The next round is in November. You can submit proposals now.

Gabor Szabo: Perl Maven - September 2014

For the full article visit Perl Maven - September 2014

The Effective Perler: Perl v5.20 fixes taint problems with locale

Perl v5.20 fixes taint checking in regular expressions that might use the locale in its pattern, even if that part of the pattern isn’t a successful part of the match. The perlsec documentation has noted that taint-checking did that, but until v5.20, it didn’t.

The only approved way to untaint a variable is through a successful pattern match with captures:

my $tainted = ...;

$tainted_var =~ m/\A (\w+ ) \z/x;
my $untainted = $1;

The problem is \w. Which characters does that match? I discussed this is Know your character classes under different semantics although I didn’t focus on \w.

With use locale, settings from outside the program decide what \w matches. If you don’t set the locale yourself, someone is setting it for you, possibly letting character classes match what you didn’t intend (or even know about). That’s contrary to the spirit of the advice in perlsec:

you must be exceedingly careful with your patterns.

If something doesn’t have an knowable meaning and you use it, you aren’t being “exceedingly careful”. perlsec recommends using no locale to fix this. It’s better to have a better pattern that doesn’t go anywhere near locale issues.

I cover this is greater detail in the “Secure Programming Techniques” in Mastering Perl.

The perlsec example is exceeding non-careful now with recent versions of Perl. Here is the example from the v5.20 docs, which has been the example since at least v5.003 (released in 1996):

 if ($data =~ /^([-\@\w.]+)$/) {
	$data = $1; 			# $data now untainted
    } else {
	die "Bad data in '$data'"; 	# log this somewhere
    }

You already know the problem with the \w. What are ^ and $? If you took my Learning Perl class, you know they are the beginning- and end-of-line anchors. Without any flags, the target string is one line. With the /m flag, they can match after or before a new line, respectively. They operate differently and you might not get the behavior that you want.

Prior to v5.14 (more correctly, the version of re that came with it), you had to put those flags with the qr// operator to compile the pattern or the operator that uses the pattern (see Know the difference between regex and match operator flags).

For v5.14 and later, the re module allows you to set default flags that apply to all patterns in its lexical scope (see Set default regular expression modifiers). The flags that might set something you didn’t intend to use and don’t show up near the code you care about. Those flags might not be there when you write the code, but someone adds them later if they have a fit of “modern Perl” or Perl Best Practices fever.

My guiding principle, though, it that any string which doesn’t exactly match what you expect isn’t safe enough to be untainted. If you mean the beginning or end of the absolute string, use the anchors that can only mean the beginning and end of string anchors, \A and \z (Item 35: Use zero-width assertions to match positions in a string).

Things to remember

  • The locale can change the meaning of character classes
  • Default regex flags change behavior from a distance
  • Don’t use character classes in regular expressions you use to untaint values
  • Use \A and \z for the absolute beginning
    and end of string

The Effective Perler: Use postfix dereferencing

Perl v5.20 offers an experimental form of dereferencing. Instead of the complicated way I’ll explain in the moment, the new postfix turns a reference into it’s contents. Since this is a new feature, you need to pull it in with the feature pragma (although this feature in undocumented in the pragma docs) (Item 2. Enable new Perl features when you need them. and turn off the experimental warnings:

use v5.20;
use feature qw(postderef);
no warnings qw(experimental::postderef);

To turn an array reference into its elements, after the arrow -> you use @ with a star, * after it:

my @elements = $array_ref->@*;

In prior versions, you have to use the circumfix notation to do the same thing:

my @elements = @$array_ref;
my @elements = @{ $array_ref };

The new notation is handy for things such as foreach which needs a list:

foreach my $element ( $array_ref->@* ) {
	...
	}

This feature is more interesting when the reference doesn’t come from a variable and you can’t conveniently use the circumfix notation. Perhaps the reference is the return value of a subroutine:

sub some_sub {
	state $ref = [ qw(cat dog bird) ];
	$ref;
	}

foreach my $element ( some_sub()->@* ) {
	...
	}

This works for all of the reference types, although the subroutine version is a bit weird. I’ll get to the subroutine postfix dereference later.

If you use a subscripty thing after the postfix sigil to get a single element of a slice:

my $first = some_sub()->@[0];
my @slice = some_sub()->@[0,-1];

The single element access is not that interesting because you can already do that without the postfix sigil:

my $first = some_sub()->[0];

Here’s an interesting bit of Perl syntax. What happens when you put two indices in that?

my( $first ) = some_sub()->[0, -1];

Perl doesn’t think that’s a slice so it treats the expression inside the braces in scalar context. The comma in scalar context evaluates the left had side, discards the result, evaluates the right hand side, and returns the result. It’s the same thing as the behavior in the FAQ What is the difference between a list and an array?.

In this code, only $first gets a value, but it’s from the last subscript, while $second gets none:

my( $first, $second ) = some_sub()->[ 
	do { say "Evaluated!"; 0 }, 1 
	]; 
say "First is $first";
say "Second is $second";

Perl evaluates the first “subscript”, but doesn’t use it. It uses the last result in the comma chain, even though the assignment looks like it’s in list context.

Wedge that postfix @ in there and both $first and $second get values:

my( $first, $second ) = some_sub()->@[ 
	do { say "Evaluated!"; 0 }, 1 
	]; 
say "First is $first";
say "Second is $second";

Don’t make the common mistake of associating the @ with an array. Just like the array and hash slices, Perl figures out the type with the the subscripty braces.

A postfix hash slice still has the @ and gives back a list of values:

sub get_hashref {
	state $ref = { 
		cat  => 'Buster',
		dog  => 'Addy',
		bird => 'Poppy',
		};
	$ref;
	}

my( $first, $second ) = get_hashref()->@{ qw(cat dog) }; 
say "First is $first";
say "Second is $second";

Curiously, in all of this, the subroutine with the postfix dereference is always called in scalar context despite what you are doing with the result or how you are assigning it. A reference is always a scalar, and that’s a single thing.

Scalar and array interpolation

You can interpolate the postfix dereference notation with scalar and array references if you enable the postderef_qq feature (also undocumented in the pragma docs):

use v5.20;

use feature qw(postderef postderef_qq);
no warnings qw(experimental::postderef);


my $scalarref = \ 'This is the string';
say "Scalar is < $scalarref->$* >";
my $arrayref = [ qw(cat bird dog) ];

say "Array is < $arrayref->@* >";

Notice that without the postderef feature, Perl would try to interpolate the $* (a Perl 4 multi-line feature removed in v5.10).

This only works for scalar and array references.

Circumfix notation

To appreciate the ease of the postfix dereference notation, you should understand the circumfix notation which earlier versions use. I cover this in Intermediate Perl, but here’s a short explanation.

Remember the notation for an array variable. There’s a sigil to give some context, an identifier (the name), and possible subscripts:

SIGIL  IDENTIFIER   SUBSCRIPT

In Perl, you can put whitespace between those parts and it still works, including with the new key/value slice syntax:

@ IDENTIFIER
$ IDENTIFIER [ 0 ]
@ IDENTIFIER [ @indices ]
% IDENTIFIER [ @indices ]

% IDENTIFIER
$ IDENTIFIER { 'cat' }
@ IDENTIFIER { @keys }
% IDENTIFIER { @keys } 

You can replace IDENTIFIER with a reference. This is the circumfix notation, called that because there’s stuff around the thing you’re dereferencing. The general syntax uses braces around the reference, which, like in the previous section, might be something that produces a reference and not a variable:

@ { REFERENCE }
$ { REFERENCE } [ 0 ]
@ { REFERENCE } [ @indices ]
% { REFERENCE } [ @indices ]

% { REFERENCE }
$ { REFERENCE } { 'cat' }
@ { REFERENCE } { @keys }
% { REFERENCE } { @keys } 

If the REFERENCE is a simple scalar variable (not a single element access, a subroutine call, or something else), you can omit the braces:

@ $array_ref
$ $array_ref [ 0 ]
@ $array_ref [ @indices ]
% $array_ref [ @indices ]

% $hash_ref
$ $hash_ref { 'cat' }
@ $hash_ref { @keys }
% $hash_ref { @keys } 

Now, suppose you have an array of arrays of arrays of hashes, just to look at something complicated:

my $array = [
	[
		[
			{ animal => 'cat', name => 'Buster' },
		]
		...
	],
	...
	];

To get the first element of the array reference, you can dereference with the circumfix notation. This gives you the first array reference under the top level reference, leaving off the braces because the top level is in a simple scalar variable:

$$array_ref[0]

To get the first item from that reference, you have to use the braces since that’s a not simple scalar variable:

${ $$array_ref[0] }[0]

To get all the elements from that reference, you add more braces:

@{ ${ $$array_ref[0] }[0] }

It’s the same with the arrow notation (even implied!) which makes the inside only slightly prettier but still needs the outside braces:

@{ $array_ref->[0]->[0] }

You can leave off the arrows between subscripts:

@{ $array_ref->[0][0] }

It’s no wonder some people think Perl is ugly.

The postfix dereferencing looks nicer because it doesn’t use the braces, but it also keeps everything in order from left to right:

$array_ref->[0][0]->@*

You need that extra arrow at the end; it’s a syntax error otherwise:

$array_ref->[0][0]@*        # syntax error!

Subroutine postfix dereference

The postfix dereference for a subroutine is slightly odd because there are different ways that you can call a subroutine, but the postfix dereference uses one of the uncommon ones.

Start with a named subroutine. With just the & and no parentheses, the called subroutine gets the arguments in the current version of @_:

local @_ = qw(Buster Mimi Ginger);

sub some_sub { "@_" }

say "With &: ", &some_sub;     # With &: Buster Mimi Ginger
say "With &(): ", &some_sub(); # With &():

The postfix dereference form has the same behavior. It executes the subroutine reference with the current @_. Dereferencing the sub reference with ->() is different; it specifies an empty argument list:

use v5.20;

use feature qw(postderef);
no warnings qw(experimental::postderef);

my $sub = sub { "@_" };

local @_ = qw(Buster Mimi Ginger);
say $sub->&*;   # Buster Mimi Ginger
say $sub->();   #

Things to remember

  • Perl v5.20 adds a postfix notation to dereference all variable types
  • Subroutines returning references are called in scalar context
  • Subroutine references dereferenced with the postfix notation use the current value of @_

Laufeyjarson writes... » Perl: PBP: 044 Heredoc Quoters

The PBP suggests that all heredocs be explicitly and deliberately quoted.  When I first read this, I didn’t know you could do that!  It’s a great idea.

What this means is to put the right kind of quotes around the name of the marker when it is used.  This tells Perl which kind of quoting, interpolating or not, it should be using for that heredoc:


my $thing = << 'END_THING';

This is something, all right!

It will have $this and $that in it, not blanks and missing variables.

END_THING

&nbsp;

my $msg = << "END_MESSAGE";

This message contains the whole thing:

$thing

END_MESSAGE

The heredoc assigned to $thing is not interpolated.

The heredoc assigned to $msg is.

And, because they’re quoted, you can tell right away by looking.

Pop quiz:  Which is the default, interpolated or not?

Answer: Hell if I know, I always quote them explicitly.

 

Perl Foundation News: The Perl Foundation to increase brand, marketing, and PR

Walnut, CA - With the planning stages of YAPC::NA 2015 (Yet Another Perl Conference) underway, The Perl Foundation has made an increased commitment to marketing and public relations: by teaming up with Pittsburgh based firm ALTRIS Incorporated.

ALTRIS Incorporated, a full-service printing, marketing, and web design firm, specializes in non-profit marketing, fundraising, branding, and event management. "We originally brought in the team at ALTRIS to help with our 2012 and 2013 end-of-the-year reports and sponsorship prospectus," said Dan Wright, Perl Foundation Treasurer, "having a professional marketing and PR team on board is the next step to growing the Foundation's brand, and events."

The Perl Foundation supports four yearly events, including the DC-Baltimore and Pittsburgh Perl Workshops, Perl Oasis, and YAPC::NA. ALTRIS Incorporated will be supporting local organizers and the Foundation's marketing and conferences committees promote their events internationally. The relationship will also include broadcasting news and updates regarding advancements in Perl, sponsorship opportunities, grants, and training materials.

Visit The Perl Foundation online at www.perlfoundation.org and on Facebook at www.fb.com/tpf.perl. Information regarding YAPC::NA 2015 will be available at www.yapcna.org.

PAL-Blog: Leberkäs-Salat

Ich koche gerne und trotzdem finden sich hier eher selten Rezepte. Das heutige Experiment war allerdings so erfolgreich, dass ich es nicht undokumentiert lassen möchte: Eine eigene Variente eines Leberkäse-Salat.

Perl Hacks: Perl’s Problems

It’s been over six weeks since I wrote my blog post on Perl usage. I really didn’t mean to leave it so long to write the follow-up. But real life intervened and I haven’t had time for much blogging. That’s still the case (I should be writing a talk right now) but I thought it was worth jotting down some quick notes about what I think is causing Perl’s decline.

Reputation

We have a lot to thank Matt Wright for. And I don’t mean that sarcastically. A lot of the popularity of Perl in the mid-90s stems directly from people like Matt and Selena Sol making their collections  of CGI programs available really early on. The popularity of their programs made Perl the de-facto standard for CGI programming.

But that was a double-edged sword. People searching the web for examples of CGI programming found Matt or Selena’s code and assumed they represented best practice. Which, of course, they didn’t. While people were blithely copying Matt’s programming style, good Perl programmers were using CGI.pm to parse their incoming parameters and separating their HTML generation out into templates.

In my previous post, I mentioned that fifteen or twenty years ago Perl was the programming language of choice for internet start-ups. That’s true, but a lot of the code written at that time was in the Matt Wright style. Matt’s style just about works for a guestbook or a form mailer. But when you try to build a business on top of code like that, it quickly becomes obvious that it’s an unmaintainable mess.

Many of the technical architects and CTOs who are making decisions about technology in companies today are the programmers who spent too many late nights battling those balls of mud in the 1990s. They were never really Perl programmers, they were only using it because it was fashionable, and they haven’t been keeping up with recent advances in Perl so it’s not surprising that they often choose to avoid using Perl.

Complexity

A lot of Perl’s reputation as executable line noise is completely unwarranted. The people who were writing those 1990s balls of mud were under such pressure to deliver that they would have almost certainly delivered something just as unmaintainable whatever language they were using. But some of that reputation is fair. I’ve been teaching Perl for almost fifteen years and I know that there are some parts of Perl that people find confusing. Here are some examples:

Sigils – I can explain things like @array, $array[$key] and even @array[@keys] to people. And most of them get it. But it takes them a while. And then it all goes to pieces again when I have to explain the difference between $array[$key] and $array->[$key].

Context – Does any other programming language have the concept of context? Yes, when used correctly it’s a powerful tool. But it’s hard to explain and a good source of hard-to-find bugs. Can anyone honestly say that they haven’t been bitten by a context bug at some point in the last years?

Data Structures – Is the difference between arrays and array references really necessary? Think of all the complexity that is added because you can’t just pass arrays and hashes into subroutines without being bitten by list flattening. As experienced Perl programmers we know the problems and our brains are hard-wired to work around it. But other languages treat all aggregate data structures as references and it all becomes a lot easier.

I know that each of these features (and half a dozen other examples I could list) makes Perl a richer and more expressive language. But this comes at the cost of learnability and readability. Perhaps that trade-off once seemed like a good idea. When you’re trying to encourage people to look at your language then the advantages seem less obvious.

Of course, none of these features can be changed as they would break pretty much every existing Perl codebase. Which would be a terrible idea. But you can get away with a lot more breakage when you increase your major version number. Which Perl hasn’t been able to do for fourteen years.

Perl 6

I need to be clear here. I think that Perl 6 looks like a great language. I am really looking forward to using on production systems. And it looks like the current Perl 6 team are doing great work towards making that possible. In fact I think that our best approach to reviving Perl’s fortunes is to get a production-ready version of Perl 6 out and to make a big noise about that.

However, that name has been a big problem.

Looking from outside the Perl echo chamber, it’s easy to believe that Perl hasn’t had a major release for twenty years. And that can probably explain a lot of Perl’s current problems.

I know that people who believe that are wrong. The current version of Perl (5.20.1 as I write this) is a lot different to the version that was current when Perl 6 was first announced (which was 5.6.0, I think). Perl has gone through huge changes in the last fourteen years. But the version number hides that.

I also know that we no longer tell people that Perl 6 is the next version of Perl. The Wikipedia page makes it clear in its first sentence that “Perl 6 is a member of the Perl family of programming languages“. So why do people continue to think it’s the next version of Perl? Well, probably because people assume that they know how software version numbers work and don’t bother to check the web site to see it a particular project has changed the standard meaning that has worked well for decades.

So Perl 6 has been simultaneously both good and bad for Perl. Good because a lot of Perl 6 ideas have been backported into Perl 5. But bad because Perl 5 has been unable to change its major version number in order to advertise these improvements to the wider software-using world.

Nothing can be done about this now. The damage is done. As I said at the start of this section, it’s likely that the only thing we can do is to bet heavily on Perl 6 and get it out as soon as possible. Perl 5 will continue to exist. People will continue to maintain and improve it. Some companies will continue to use it. But it’s usage will continue to fall. I really think it’s too late to do anything about that.

The post Perl’s Problems appeared first on Perl Hacks.

Perl News: Perl and Shellshock

The Shellshock bug affects the Bash shell.

Though Perl it self is not directly affected, some web servers (such as Apache) that run Perl are, additional if Perl shells out: system(), backticks, qx etc, that may then instantiate bash which would be a potential attack vector.

Perl tricks has a good write up.

Your best protection is upgrading Bash on your operating system as soon as patches are available, or switching all accounts to an alternative shell (such as ‘sh’, or ‘zsh’) until there is.

Be aware there were 2 releases of bash to fix this, as the first was not complete.

Perl Foundation News: Grant Report: Modern OO Programming in Perl (Book) - Sept 2014

Toby Inkster reports on his book-writing progress in his latest blog post. Highlights:

  • The material is open and mirrored at GitHub and Bitbucket. He welcomes your comments and suggestions.
  • Work on the namespace chapter is beginning.

I and I'm sure many others are looking forward to having this great resource.

MAJ

Perl News: DWIM Perl for Linux 5.20.1-2 released

DWIM Perl for Linux is a binary Perl distribution including perl and a bunch of CPAN modules. It was created to make it easy to get started with Perl without the need to think about the installation of additional modules. Batteries included.

I have been working on this for quite some time, and I think I finally found a path that will make it easy for me to create future versions of this distribution.

This is the first public release in this new era.

If you have a Linux machine – any Linux machine – it would extremely useful if you could try this and report any problems you might run into.

Ovid: Veure Update

Just in case you're curious, I'm still hacking on Veure, though the last month has kept me busy on a bunch of other things (our daughter just started school, so that's a big one!)

I've been building so much of the infrastructure that you might be surprised to realize that I've only just gotten around to being able to equip weapons and armor:

My last entry gives some hints on how this works.

The other developer has been working on the cockpit view. If you travel from system to system in your own ship, the experience should be different than if you take public shuttles. I haven't actually seen his work yet, so no screenshot on that one.

Update: OK, I have some of the initial screenshots for the cockpit work. They look great, but not sharing until some things are settled.

NEILB: My first Perl Mongers meeting

I attended my first ever Perl Mongers meeting this evening, a Thames Valley PM (TVPM) tech talks session, and presented a rough first draft of a talk I've submitted to the London Perl Workshop. We had two lightning talks then random Perl chitchat at the pub.

Laufeyjarson writes... » Perl: PBP: 043 Heredoc Terminators

Mr. Conway suggests making all heredoc terminators single uppercase identifiers which start with a standard prefix.  This is part of naming things regularly, and I support it.

I think the more senior an engineer you get to be, the more regular your names for things become.  It becomes second nature to end heredocs with END_WHATEVER, and to give WHATEVER a really unique name so that you don’t miss it.  Uniqueness and legibility are important here, so you don’t wonder where the heredoc stops and the program starts again.

Just like I name all my file handles with something_fh (file handle) and format dates as YYY-MM-DDZHH:MM:SS (where ‘Z’ means UTC) this seems a natural thing to do.  I, or whoever encounters these things later, will appreciate them.  Lots of little things add up to easier to work with code.

Perl Foundation News: Outreach Program for Women - Winter 2014 / 2015

I am delighted to announce that the Perl Foundation will once again be taking part in the GNOME Outreach Program for Women.

The Outreach Program for Women (OPW) was started by the GNOME Foundation in 2006 to encourage women to participate in the GNOME project. In the first round eight interns took part working from GNOME. This program has been expanded and in the last round, that took place this summer, forty interns were accepted and seventeen Free and Open Source organisations took part including The Perl Foundation.

We are offering one internship in the winter program which runs from the 9th December 2014 to the 9th March 2015. We have mentors from Dancer and MetaCPAN signed up to provide project ideas and you can read about possible projects on our information page.

We could not take part in this program without the support of our sponsors. In particular I would like to thank Wendy and Liz for donating $1000 to the program.

PAL-Blog: Kinder-Hobbys: Wunsch und Realität

Zu Kindergartenzeiten hängen die Hobbys der Kinder hauptsächlich von der verfügbaren Zeit der Eltern für den notwendigen Taxi-Dienst ab. Mit dem Schulanfang ändert sich vieles, darunter auch die frei verfügbare Zeit und die zur Verfügung stehende kindliche Energie. In der Schule hat sie sich schon gut eingefunden, aber ihr Lebensrythmus hat sich der neuen Situation noch nicht angepasst.

Nestoria Dev Blog: Module of the month September 2014: DBIx::Connector

Welcome to another Module of the Month blog post, a recurring post in which we highlight particular modules, project or tools that we use here at Nestoria.

This month’s award goes to DBIx::Connector, a great tool in the grand UNIX tradition of doing one thing well; in this case, managing a connection to a database.

Like many things in programming, connecting to a database should be something that’s trivially easy - but then the real world intervenes!

Database connections…

  • are often over the network, which is not always reliable
  • occur in forking processes (eg. a forking web server)
  • are an “edge” from your Perl process to the outside world, so encoding becomes important
  • need to handle normal query/response, but also warnings and errors from the database server
  • need to handle different servers as seamlessly as possible

DBIx::Connector does it all!

There’s drivers for MSSQL, Oracle, PostgreSQL, SQLite and MySQL. I can only comment on the MySQL one at the moment, but I have nothing but good things to say about it.

I have used it in heavily forking environments with no troubles, both in web servers but also in basic parallel scripts using Parallel::ForkManager.

One feature I’ve not yet tried, but I’m excited to give a go at some point, is the Connection Modes.

We run all of our queries on the safest mode - “ping” - which causes the database handle to ping and make sure the connection is alive before every request.

Other available modes are “fixup” - where the query is sent with no ping, and then the connection is re-established and the query re-sent on a failure. In most cases this leads to many many fewer pings with no downside, which can be a great performance improvement.

That’s pretty much all I have to say about DBIx::Connector - like I said, it does one thing well. So without further ado let me congratulate David Wheeler and bestow upon him our traditional $1/week (for one year) donation. DWHEELER is a prolific and talented CPAN author and I highly recommend you check out some of his modules, especially DBIx::Connector! Thanks David!

brian d foy: Test::More has lots of crazy new development that's breaking my modules

I still wish we had a way to remove reports from CPAN Testers. The case of a broken Test::More is a really good reason for this.

I received many fail reports for Business::ISBN, which I've been working on lately. However, it's from a test I hadn't touched for things I wasn't working on.

The failure looked odd. I've never heard of Test::More::DeepCheck:

Modification of non-creatable array value attempted, subscript -1 at .../Test/More/DeepCheck.pm line 82.

Then I noticed that all of the fail reports reported the same development version of Test::More:

build_requires:

Module Need Have
-------------------- -------- ------------
ExtUtils::MakeMaker 0 6.99_14
Test::More 0.95 1.301001_045

There are a few other bad tests for things happening in v5.21, but for the most part I get the black eye on MetaCPAN or CPAN Search from the CPAN Testers using a bad version of Test::More. It happens. Schwern once said that he could break all of CPAN.

The problem, though, is not a broken Test::More but a development version of one that we don't trust yet. I know that Andreas and Slaven want to test everything, but perhaps these sorts of tests don't need to go to CPAN Testers. In the last month I've spent a couple hours tracking down problems that weren't anything to do with anything I did and with what I think is some overzealous new development in Test::More. If you want to write a better is_deeply, make a separate module and let people play with it.

Fortunately, normal users won't have this broken Test::More. Unfortunately, they still get to see the red bars for my module.

Laufeyjarson writes... » Perl: PBP: 042 Heredoc Indentation

The Best Practices suggest that putting a heredoc in a deeply nested function looks funny because it has to be left-justified, and suggests creating a “theredoc” by writing a function that does nothing besides evaluate and return the heredoc.

This does solve the indentation problem, but gets more complex if you’re trying to interpolate a bunch of variables into the heredoc.  You wind up sticking them into a hashref and passing the hashref, then using those in the heredoc.  It makes for quite a complex function, and is not as simple and easy to call as the sample in the book.

It also still leaves the heredoc at the left margin.

The Perl Cookbook, in recipie 1.11, “Indenting Here Documents” suggests using a regular expression to eat all the leading spaces from the heredoc.  You get the heredoc where it has to go and can indent it properly.

I don’t usually bother with either.  Considering that the default perltidy settings will try and avoid breaking long strings by reverse-indenting them as far as needed, it is not unusual to wind up with long strings pushed back to the left margin anyway.  I just stick the heredoc in where it needs to be and am happy.

As I mentioned before, a potentially better solution is to take the literal strings out of the program and put them in data files, fit for internationalization.  But I haven’t done it and can’t tell you the pros and cons.

Perl News: Perl 5.20.1 is now available

Perl 5.20.1 has been released, this is the latest stable version of Perl.

Changes include performance enhancements and various bug fixes.

Perl 5.20.1 represents approximately 4 months of development since Perl 5.20.0 and contains approximately 12,000 lines of changes across 170 files from 36 authors.

NEILB: What is "the Perl community"?

Prompted by RIBASUSHI's blog post, several discussions, github issue threads, and a long IRC chat with MITHALDU (pro tip: don't get him started! :-), I've been thinking about Perl community. I realised that one of the things I was reacting to was the suggestion that for Perl, "IRC == community". So, what is "the perl community"?

Perl Hacks: “I Do Not Want To Use Any Modules”

Almost every day on the Perl groups on LinkedIn (or Facebook, or StackOverflow, or somewhere like that) I see a question that includes the restriction “I do not want to use any modules”.

There was one on LinkedIn yesterday. He wanted to create a MIME message to pass to sendmail, but he didn’t want to install any modules. Because “getting a module installed will have to go though a long long process of approvals”.

And I understand that. I really do. We’ve all seen places where getting new software installed is a problem. But I see that problem as a bug in the development process. A bug that needs to be fixed before anything can get done in a reasonable manner. Here’s what I’ve just written in reply:

Of course it can be achieved without modules. Just create an email in the correct format and pass it to sendmail.

Ah, but what’s the right format? Well, that is (of course) the tricky bit. I have no idea what the correct format is. Oh, I could Google a bit and come up with some ideas. I might even find the RFC that defines the MIME format. And then I’d be able to knock up some code that created something that looked like it would work. But would I be sure that it works? In every case? With all the weird corner-cases that people might throw at it?

This is where CPAN modules come in handy. You’re using someone else’s knowledge. Someone who is (hopefully) an expert in the field. And because modules are used by lots of people, bugs get found and fixed.

A lot of modern Perl programming is about choosing the right set of CPAN modules and plumbing them together. That’s what makes Perl so powerful. That’s what makes Perl programmers so efficient. We’re standing on the shoulders of giants and re-using other people’s code.

If you’re not going to use CPAN then you might as well use shell-scripting or awk.

If you’re in a situation where getting CPAN modules installed is hard, then fixing that problem should be your first priority. Because that’s a big impediment to your Perl programming. And investing time in fixing that will be massively beneficial to you in a very short amount of time.

The obvious solution is to install your own module tree (alongside your own Perl) as part of your application. But that might be overkill in some situations, so you could also consider using the system Perl and asking your sysadmin to install packages from your distribution’s repositories. Of course, that might need a change in process. But it’s a change that is well worth making; a change that will improve your (programming) life immensely.

Update: Some very interesting discussion about this over on Reddit.

The post “I Do Not Want To Use Any Modules” appeared first on Perl Hacks.

PAL-Blog: Zoe's erster Schultag

Zugegeben, der erste offizielle Schultag war die Einschulung, aber am Montag folgte dann der erste in voller Länge und es gab die ersten Hausaufgaben - oder doch nicht?

Dave's Free Press: Journal: Devel::CheckLib can now check libraries' contents

Perlgeek.de : Rakudo's Abstract Syntax Tree

After or while a compiler parses a program, the compiler usually translates the source code into a tree format called Abstract Syntax Tree, or AST for short.

The optimizer works on this program representation, and then the code generation stage turns it into a format that the platform underneath it can understand. Actually I wanted to write about the optimizer, but noticed that understanding the AST is crucial to understanding the optimizer, so let's talk about the AST first.

The Rakudo Perl 6 Compiler uses an AST format called QAST. QAST nodes derive from the common superclass QAST::Node, which sets up the basic structure of all QAST classes. Each QAST node has a list of child nodes, possibly a hash map for unstructured annotations, an attribute (confusingly) named node for storing the lower-level parse tree (which is used to extract line numbers and context), and a bit of extra infrastructure.

The most important node classes are the following:

QAST::Stmts
A list of statements. Each child of the node is considered a separate statement.
QAST::Op
A single operation that usually maps to a primitive operation of the underlying platform, like adding two integers, or calling a routine.
QAST::IVal, QAST::NVal, QAST::SVal
Those hold integer, float ("numeric") and string constants respectively.
QAST::WVal
Holds a reference to a more complex object (for example a class) which is serialized separately.
QAST::Block
A list of statements that introduces a separate lexical scope.
QAST::Var
A variable
QAST::Want
A node that can evaluate to different child nodes, depending on the context it is compiled it.

To give you a bit of a feel of how those node types interact, I want to give a few examples of Perl 6 examples, and what AST they could produce. (It turns out that Perl 6 is quite a complex language under the hood, and usually produces a more complicated AST than the obvious one; I'll ignore that for now, in order to introduce you to the basics.)

Ops and Constants

The expression 23 + 42 could, in the simplest case, produce this AST:

QAST::Op.new(
    :op('add'),
    QAST::IVal.new(:value(23)),
    QAST::IVal.new(:value(42)),
);

Here an QAST::Op encodes a primitive operation, an addition of two numbers. The :op argument specifies which operation to use. The child nodes are two constants, both of type QAST::IVal, which hold the operands of the low-level operation add.

Now the low-level add operation is not polymorphic, it always adds two floating-point values, and the result is a floating-point value again. Since the arguments are integers and not floating point values, they are automatically converted to float first. That's not the desired semantics for Perl 6; actually the operator + is implemented as a subroutine of name &infix:<+>, so the real generated code is closer to

QAST::Op.new(
    :op('call'),
    :name('&infix:<+>'),    # name of the subroutine to call
    QAST::IVal.new(:value(23)),
    QAST::IVal.new(:value(42)),
);

Variables and Blocks

Using a variable is as simple as writing QAST::Var.new(:name('name-of-the-variable')), but it must be declared first. This is done with QAST::Var.new(:name('name-of-the-variable'), :decl('var'), :scope('lexical')).

But there is a slight caveat: in Perl 6 a variable is always scoped to a block. So while you can't ordinarily mention a variable prior to its declaration, there are indirect ways to achieve that (lookup by name, and eval(), to name just two).

So in Rakudo there is a convention to create QAST::Block nodes with two QAST::Stmts children. The first holds all the declarations, and the second all the actual code. That way all the declaration always come before the rest of the code.

So my $x = 42; say $x compiles to roughly this:

QAST::Block.new(
    QAST::Stmts.new(
        QAST::Var.new(:name('$x'), :decl('var'), :scope('lexical')),
    ),
    QAST::Stmts.new(
        QAST::Op.new(
            :op('p6store'),
            QAST::Var.new(:name('$x')),
            QAST::IVal.new(:value(42)),
        ),
        QAST::Op.new(
            :op('call'),
            :name('&say'),
            QAST::Var.new(:name('$x')),
        ),
    ),
);

Polymorphism and QAST::Want

Perl 6 distinguishes between native types and reference types. Native types are closer to the machine, and their type name is always lower case in Perl 6.

Integer literals are polymorphic in that they can be either a native int or a "boxed" reference type Int.

To model this in the AST, QAST::Want nodes can contain multiple child nodes. The compile-time context decides which of those is acutally used.

So the integer literal 42 actually produces not just a simple QAST::IVal node but rather this:

QAST::Want.new(
    QAST::WVal(Int.new(42)),
    'Ii',
    QAST::Ival(42),
)

(Note that Int.new(42) is just a nice notation to indicate a boxed integer object; it doesn't quite work like this in the code that translate Perl 6 source code into ASTs).

The first child of a QAST::Want node is the one used by default, if no other alternative matches. The comes a list where the elements with odd indexes are format specifications (here Ii for integers) and the elements at even-side indexes are the AST to use in that case.

An interesting format specification is 'v' for void context, which is always chosen when the return value from the current expression isn't used at all. In Perl 6 this is used to eagerly evaluate lazy lists that are used in void context, and for several optimizations.

Dave's Free Press: Journal: I Love Github

Dave's Free Press: Journal: Palm Treo call db module

Ocean of Awareness: Evolvable languages

Ideally, if a syntax is useful and clear, and a programmer can easily read it at a glance, you should be able to add it to an existing language. In this post, I will describe a modest incremental change to the Perl syntax.

It's one I like, because that's beside the point, for two reasons. First, it's simply intended as an example of language evolution. Second, regardless of its merits, it is unlikely to happen, because of the way that Perl 5 is parsed. In this post I will demonstrate a way of writing a parser, so that this change, or others, can be made in a straightforward way, and without designing your language into a corner.

When initializing a hash, Perl 5 allows you to use not just commas, but also the so-called "wide comma" (=>). The wide comma is suggestive visually, and it also has some smarts about what a hash key is: The hash key is always converted into a string, so that wide comma knows that in a key-value pair like this:

    key1 => 711,

that key1 is intended as a string.

But what about something like this?

  {
   company name => 'Kamamaya Technology',
   employee 1 => first name => 'Jane',
   employee 1 => last name => 'Doe',
   employee 1 => title => 'President',
   employee 2 => first name => 'John',
   employee 2 => last name => 'Smith',
   employee 3 => first name => 'Clarence',
   employee 3 => last name => 'Darrow',
  }

Here I think the intent is obvious -- to create an employee database in the form of a hash of hashes, allowing spaces in the keys. In Data::Dumper format, the result would look like:

{
              'employee 2' => {
                                'last name' => '\'Smith\'',
                                'first name' => '\'John\''
                              },
              'company name' => '\'Kamamaya Technology\'',
              'employee 3' => {
                                'last name' => '\'Darrow\'',
                                'first name' => '\'Clarence\''
                              },
              'employee 1' => {
                                'title' => '\'President\'',
                                'last name' => '\'Doe\'',
                                'first name' => '\'Jane\''
                              }
            }

And in fact, that is the output of the script in this Github gist, which parses the previous "extended Perl 5" snippet using a Marpa grammar before passing it on to Perl.

Perl 5 does not allow a syntax like this, and looking at its parsing code will tell you why -- it's already a maintenance nightmare. The extension I've described above could, in theory, be added to Perl 5, but doing so would aggravate an already desperate maintenance situation.

Now, depending on taste, you may be just as happy that you'll never see the extensions I have just outlined in Perl 5. But I don't think it is as easy to be happy about a parsing technology that quickly paints the languages which use it into a corner.

How it works

The code is in a Github gist. For the purposes of the example, I've implemented a toy subset of Perl. But this approach has been shown to scale. There are full Marpa-powered parsers of C, ECMAScript, XPath, and liberal HTML.

Marpa is a general BNF parser, which means that anything you can write in BNF, Marpa can parse. For practical parsing, what matters are those grammars that can be parsed in linear time, and with Marpa that class is vast, including all the classes of grammar currently in practical use. To describe the class of grammars that Marpa parses in linear time, assume that you have either a left or right parser, with infinite lookahead, that uses regular expressions. (A parser like this is called LR-regular.) Assume that this LR-regular parser parses your grammar. In that case, you can be sure that Marpa will parse that grammar in linear time, and without doing the lookahead. (Instead Marpa tracks possibilities in a highly-optimized table.) Marpa also parses many grammars that are not LR-regular in linear time, but just LR-regular is very likely to include any class of grammar that you will be interested in parsing. The LR-regular grammars easily include all those that can be parsed using yacc, recursive descent or regular expressions.

Marpa excels at those special hacks so necessary in recursive descent and other techniques. Marpa allows you to define events that will stop it at symbols or rules, both before and after. While stopped, you can hand processing over to your own custom code. Your custom code can feed your own tokens to the parse for as long as you like. In doing so, it can consult Marpa to determine exactly what symbols and rules have been recognized and which ones are expected. Once finished with custom processing, you can then ask Marpa to pick up again at any point you wish.

The craps game is over

The bottom line is that if you can describe your language extension in BNF, or in BNF plus some hacks, you can rely on Marpa parsing it in reasonable time. Language design has been like shooting crap in a casino that sets you up to win a lot of the first rolls before the laws of probability grind you down. Marpa changes the game.

To learn more

Marpa::R2 is available on CPAN. A list of my Marpa tutorials can be found here. There are new tutorials by Peter Stuifzand and amon. The Ocean of Awareness blog focuses on Marpa, and it has an annotated guide. Marpa has a web page that I maintain and Ron Savage maintains another. For questions, support and discussion, there is the "marpa parser" Google Group.

Comments

Comments on this post can be made in Marpa's Google group.

Perlgeek.de : doc.perl6.org and p6doc

Background

Earlier this year I tried to assess the readiness of the Perl 6 language, compilers, modules, documentation and so on. While I never got around to publish my findings, one thing was painfully obvious: there is a huge gap in the area of documentation.

There are quite a few resources, but none of them comprehensive (most comprehensive are the synopsis, but they are not meant for the end user), and no single location we can point people to.

Announcement

So, in the spirit of xkcd, I present yet another incomplete documentation project: doc.perl6.org and p6doc.

The idea is to take the same approach as perldoc for Perl 5: create user-level documentation in Pod format (here the Perl 6 Pod), and make it available both on a website and via a command line tool. The source (documentation, command line tool, HTML generator) lives at https://github.com/perl6/doc/. The website is doc.perl6.org.

Oh, and the last Rakudo Star release (2012.06) already shipped p6doc.

Status and Plans

Documentation, website and command line tool are all in very early stages of development.

In the future, I want both p6doc SOMETHING and http://doc.perl6.org/SOMETHING to either document or link to documentation of SOMETHING, be it a built-in variable, an operator, a type name, routine name, phaser, constant or... all the other possible constructs that occur in Perl 6. URLs and command line arguments specific to each type of construct will also be available (/type/SOMETHING URLs already work).

Finally I want some way to get a "full" view of a type, ie providing all methods from superclasses and roles too.

Help Wanted

All of that is going to be a lot of work, though the most work will be to write the documentation. You too can help! You can write new documentation, gather and incorporate already existing documentation with compatible licenses (for example synopsis, perl 6 advent calendar, examples from rosettacode), add more examples, proof-read the documentation or improve the HTML generation or the command line tool.

If you have any questions about contributing, feel free to ask in #perl6. Of course you can also; create pull requests right away :-).

Ocean of Awareness: Language design: Exploiting ambiguity

Currently, in designing languages, we don't allow ambiguities -- not even potential ones. We insist that it must not be even possible to write an ambiguous program. This is unnecessarily restrictive.

This post is written in English, which is full of ambiguities. Natural languages are always ambiguous, because human beings find that that's best way for versatile, rapid, easy communication. Human beings arrange things so that every sentence is unambiguous in context. Mistakes happen, and ambiguous sentences occur, but in practice, the problem is manageable. In a conversation, for example, we would just ask for clarification.

If we allow our computer languages to take their most natural forms, they will often have the potential for ambiguity. This is even less of a problem on a computer than it is in conversation -- a computer can always spot an actual ambiguity immediately. When actual ambiguities occur, we can deal with them in exactly the same way that we deal with any other syntax problem: The computer catches it and reports it, and we fix it.

An example

To illustrate, I'll use a DSL-writing DSL language. It'll be tiny -- just lexeme declarations and BNF rules. Newlines will not be significant. Statements can end with a semicolon, but that's optional. (The code for this post is in a Github gist.)

Here is a toy calculator written in our tiny DSL-writing language:

  Number matches '\d+'
  E ::= T '*' F
  E ::= T
  T ::= F '+' Number
  T ::= Number

Trying an improvement

With a grammar this small, just about anything is readable. But let's assume we want to improve it, and that we decide that the lexeme declaration of Number really belongs after the rules which use it. (If our grammar was longer, this could make a real difference.) So we move the lexeme declaration to the end:

  E ::= T '*' F
  E ::= T
  T ::= F '+' Number
  T ::= Number
  Number matches '\d+'

But there's an issue

It turns out the grammar for our toy DSL-writer is ambiguous. When a lexeme declaration follows a BNF rule, there's no way to tell whether or not it is actually a lexeme declaration, or part of the BNF rule. Our parser catches that:

Parse of BNF/Scanless source is ambiguous
Length of symbol "Statement" at line 4, column 1 is ambiguous
  Choices start with: T ::= Number
  Choice 1, length=12, ends at line 4, column 12
  Choice 1: T ::= Number
  Choice 2, length=33, ends at line 5, column 20
  Choice 2: T ::= Number\nNumber matches '\\d

Here Marpa tells you why it thinks your script is ambiguous. Two different statements can start at line 4. Both of them are BNF rules, but one is longer than the other.

Just another syntax error

Instead of having to design a language where ambiguity was not even possible, we designed one where ambiguities can happen. This allows us to design a much more flexible language, like the ones we choose when we humans communicate with each other. The downside is that actual ambiguities will occur, but they can be reported, and fixed, just like any other syntax error.

In this case, we recall we allowed semi-colons to terminate a rule, and our fix is easy:

  E ::= T '*' F
  E ::= T
  T ::= F '+' Number
  T ::= Number ;
  Number matches '\d+'

To learn more

The code for this post is a gist on Github. It was written using Marpa::R2, which is available on CPAN. A list of my Marpa tutorials can be found here. There are new tutorials by Peter Stuifzand and amon. The Ocean of Awareness blog focuses on Marpa, and it has an annotated guide. Marpa has a web page that I maintain and Ron Savage maintains another. For questions, support and discussion, there is a "marpa parser" Google Group and an IRC channel: #marpa at irc.freenode.net.

Comments

Comments on this post can be made in Marpa's Google group.

Dave's Free Press: Journal: Graphing tool

Dave's Free Press: Journal: XML::Tiny released

Perlgeek.de : Pattern Matching and Unpacking

When talking about pattern matching in the context of Perl 6, people usually think about regex or grammars. Those are indeed very powerful tools for pattern matching, but not the only one.

Another powerful tool for pattern matching and for unpacking data structures uses signatures.

Signatures are "just" argument lists:

sub repeat(Str $s, Int $count) {
    #     ^^^^^^^^^^^^^^^^^^^^  the signature
    # $s and $count are the parameters
    return $s x $count
}

Nearly all modern programming languages have signatures, so you might say: nothing special, move along. But there are two features that make them more useful than signatures in other languages.

The first is multi dispatch, which allows you to write several routines with the name, but with different signatures. While extremely powerful and helpful, I don't want to dwell on them. Look at Chapter 6 of the "Using Perl 6" book for more details.

The second feature is sub-signatures. It allows you to write a signature for a sigle parameter.

Which sounds pretty boring at first, but for example it allows you to do declarative validation of data structures. Perl 6 has no built-in type for an array where each slot must be of a specific but different type. But you can still check for that in a sub-signature

sub f(@array [Int, Str]) {
    say @array.join: ', ';
}
f [42, 'str'];      # 42, str
f [42, 23];         # Nominal type check failed for parameter '';
                    # expected Str but got Int instead in sub-signature
                    # of parameter @array

Here we have a parameter called @array, and it is followed by a square brackets, which introduce a sub-signature for an array. When calling the function, the array is checked against the signature (Int, Str), and so if the array doesn't contain of exactly one Int and one Str in this order, a type error is thrown.

The same mechanism can be used not only for validation, but also for unpacking, which means extracting some parts of the data structure. This simply works by using variables in the inner signature:

sub head(*@ [$head, *@]) {
    $head;
}
sub tail(*@ [$, *@tail]) {
    @tail;
}
say head <a b c >;      # a
say tail <a b c >;      # b c

Here the outer parameter is anonymous (the @), though it's entirely possible to use variables for both the inner and the outer parameter.

The anonymous parameter can even be omitted, and you can write sub tail( [$, *@tail] ) directly.

Sub-signatures are not limited to arrays. For working on arbitrary objects, you surround them with parenthesis instead of brackets, and use named parameters inside:

multi key-type ($ (Numeric :$key, *%)) { "Number" }
multi key-type ($ (Str     :$key, *%)) { "String" }
for (42 => 'a', 'b' => 42) -> $pair {
    say key-type $pair;
}
# Output:
# Number
# String

This works because the => constructs a Pair, which has a key and a value attribute. The named parameter :$key in the sub-signature extracts the attribute key.

You can build quite impressive things with this feature, for example red-black tree balancing based on multi dispatch and signature unpacking. (More verbose explanation of the code.) Most use cases aren't this impressive, but still it is very useful to have occasionally. Like for this small evaluator.

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 2

Perlgeek.de : YAPC Europe 2013 Day 3

The second day of YAPC Europe climaxed in the river boat cruise, Kiev's version of the traditional conference dinner. It was a largish boat traveling on the Dnipro river, with food, drinks and lots of Perl folks. Not having fixed tables, and having to get up to fetch food and drinks led to a lot of circulation, and thus meeting many more people than at traditionally dinners. I loved it.

Day 3 started with a video message from next year's YAPC Europe organizers, advertising for the upcoming conference and talking a bit about the oppurtunities that Sofia offers. Tempting :-).

Monitoring with Perl and Unix::Statgrab was more about the metrics that are available for monitoring, and less about doing stuff with Perl. I was a bit disappointed.

The "Future Perl Versioning" Discussion was a very civilized discussion, with solid arguments. Whether anybody changed their minds remain to be seen.

Carl Mäsak gave two great talks: one on reactive programming, and one on regular expressions. I learned quite a bit in the first one, and simply enjoyed the second one.

After the lunch (tasty again), I attended Jonathan Worthington's third talk, MoarVM: a metamodel-focused runtime for NQP and Rakudo. Again this was a great talk, based on great work done by Jonathan and others during the last 12 months or so. MoarVM is a virtual machine designed for Perl 6's needs, as we understand them now (as opposed to parrot, which was designed towards Perl 6 as it was understood around 2003 or so, which is considerably different).

How to speak manager was both amusing and offered a nice perspective on interactions between managers and programmers. Some of this advice assumed a non-tech-savy manager, and thus didn't quite apply to my current work situation, but was still interesting.

I must confess I don't remember too much of the rest of the talks that evening. I blame five days of traveling, hackathon and conference taking their toll on me.

The third session of lightning talks was again an interesting mix, containing interesting technical tidbits, the usual "we are hiring" slogans, some touching and thoughtful moments, and finally a song by Piers Cawley. He had written the lyrics in the previous 18 hours (including sleep), to (afaict) a traditional irish song. Standing up in front of ~300 people and singing a song that you haven't really had time to practise takes a huge amount of courage, and I admire Piers both for his courage and his great performance. I hope it was recorded, and makes it way to the public soon.

Finally the organizers spoke some closing words, and received their well-deserved share of applause.

As you might have guess from this and the previous blog posts, I enjoyed this year's YAPC Europe very much, and found it well worth attending, and well organized. I'd like to give my heart-felt thanks to everybody who helped to make it happen, and to my employer for sending me there.

This being only my second YAPC, I can't make any far-reaching comparisons, but compared to YAPC::EU 2010 in Pisa I had an easier time making acquaintances. I cannot tell what the big difference was, but the buffet-style dinners at the pre-conference meeting and the river boat cruise certainly helped to increase the circulation and thus the number of people I talked to.

Dave's Free Press: Journal: YAPC::Europe 2007 travel plans

Perlgeek.de : A small regex optimization for NQP and Rakudo

Recently I read the course material of the Rakudo and NQP Internals Workshop, and had an idea for a small optimization for the regex engine. Yesterday night I implemented it, and I'd like to walk you through the process.

As a bit of background, the regex engine that Rakudo uses is actually implemented in NQP, and used by NQP too. The code I am about to discuss all lives in the NQP repository, but Rakudo profits from it too.

In addition one should note that the regex engine is mostly used for parsing grammar, a process which involves nearly no scanning. Scanning is the process where the regex engine first tries to match the regex at the start of the string, and if it fails there, moves to the second character in the string, tries again etc. until it succeeds.

But regexes that users write often involve scanning, and so my idea was to speed up regexes that scan, and where the first thing in the regex is a literal. In this case it makes sense to find possible start positions with a fast string search algorithm, for example the Boyer-Moore algorithm. The virtual machine backends for NQP already implement that as the index opcode, which can be invoked as start = index haystack, needle, startpos, where the string haystack is searched for the substring needle, starting from position startpos.

From reading the course material I knew I had to search for a regex type called scan, so that's what I did:

$ git grep --word scan
3rdparty/libtommath/bn_error.c:   /* scan the lookup table for the given message
3rdparty/libtommath/bn_mp_cnt_lsb.c:   /* scan lower digits until non-zero */
3rdparty/libtommath/bn_mp_cnt_lsb.c:   /* now scan this digit until a 1 is found
3rdparty/libtommath/bn_mp_prime_next_prime.c:                   /* scan upwards 
3rdparty/libtommath/changes.txt:       -- Started the Depends framework, wrote d
src/QRegex/P5Regex/Actions.nqp:                     QAST::Regex.new( :rxtype<sca
src/QRegex/P6Regex/Actions.nqp:                     QAST::Regex.new( :rxtype<sca
src/vm/jvm/QAST/Compiler.nqp:    method scan($node) {
src/vm/moar/QAST/QASTRegexCompilerMAST.nqp:    method scan($node) {
Binary file src/vm/moar/stage0/NQPP6QRegexMoar.moarvm matches
Binary file src/vm/moar/stage0/QASTMoar.moarvm matches
src/vm/parrot/QAST/Compiler.nqp:    method scan($node) {
src/vm/parrot/stage0/P6QRegex-s0.pir:    $P5025 = $P5024."new"("scan" :named("rx
src/vm/parrot/stage0/QAST-s0.pir:.sub "scan" :subid("cuid_135_1381944260.6802") 
src/vm/parrot/stage0/QAST-s0.pir:    push $P5004, "scan"

The binary files and .pir files are generated code included just for bootstrapping, and not interesting for us. The files in 3rdparty/libtommath are there for bigint handling, thus not interesting for us either. The rest are good matches: src/QRegex/P6Regex/Actions.nqp is responsible for compiling Perl 6 regexes to an abstract syntax tree (AST), and src/vm/parrot/QAST/Compiler.nqp compiles that AST down to PIR, the assembly language that the Parrot Virtual Machine understands.

So, looking at src/QRegex/P6Regex/Actions.nqp the place that mentions scan looked like this:

    $block<orig_qast> := $qast;
    $qast := QAST::Regex.new( :rxtype<concat>,
                 QAST::Regex.new( :rxtype<scan> ),
                 $qast,
                 ($anon
                      ?? QAST::Regex.new( :rxtype<pass> )
                      !! (nqp::substr(%*RX<name>, 0, 12) ne '!!LATENAME!!'
                            ?? QAST::Regex.new( :rxtype<pass>, :name(%*RX<name>) )
                            !! QAST::Regex.new( :rxtype<pass>,
                                   QAST::Var.new(
                                       :name(nqp::substr(%*RX<name>, 12)),
                                       :scope('lexical')
                                   ) 
                               )
                          )));

So to make the regex scan, the AST (in $qast) is wrapped in QAST::Regex.new(:rxtype<concat>,QAST::Regex.new( :rxtype<scan> ), $qast, ...), plus some stuff I don't care about.

To make the optimization work, the scan node needs to know what to scan for, if the first thing in the regex is indeed a constant string, aka literal. If it is, $qast is either directly of rxtype literal, or a concat node where the first child is a literal. As a patch, it looks like this:

--- a/src/QRegex/P6Regex/Actions.nqp
+++ b/src/QRegex/P6Regex/Actions.nqp
@@ -667,9 +667,21 @@ class QRegex::P6Regex::Actions is HLL::Actions {
     self.store_regex_nfa($code_obj, $block, QRegex::NFA.new.addnode($qast))
     self.alt_nfas($code_obj, $block, $qast);
 
+    my $scan := QAST::Regex.new( :rxtype<scan> );
+    {
+        my $q := $qast;
+        if $q.rxtype eq 'concat' && $q[0] {
+            $q := $q[0]
+        }
+        if $q.rxtype eq 'literal' {
+            nqp::push($scan, $q[0]);
+            $scan.subtype($q.subtype);
+        }
+    }
+
     $block<orig_qast> := $qast;
     $qast := QAST::Regex.new( :rxtype<concat>,
-                 QAST::Regex.new( :rxtype<scan> ),
+                 $scan,
                  $qast,

Since concat nodes have always been empty so far, the code generators don't look at their child nodes, and adding one with nqp::push($scan, $q[0]); won't break anything on backends that don't support this optimization yet (which after just this patch were all of them). Running make test confirmed that.

My original patch did not contain the line $scan.subtype($q.subtype);, and later on some unit tests started to fail, because regex matches can be case insensitive, but the index op works only case sensitive. For case insensitive matches, the $q.subtype of the literal regex node would be ignorecase, so that information needs to be carried on to the code generation backend.

Once that part was in place, and some debug nqp::say() statements confirmed that it indeed worked, it was time to look at the code generation. For the parrot backend, it looked like this:

    method scan($node) {
        my $ops := self.post_new('Ops', :result(%*REG<cur>));
        my $prefix := self.unique('rxscan');
        my $looplabel := self.post_new('Label', :name($prefix ~ '_loop'));
        my $scanlabel := self.post_new('Label', :name($prefix ~ '_scan'));
        my $donelabel := self.post_new('Label', :name($prefix ~ '_done'));
        $ops.push_pirop('repr_get_attr_int', '$I11', 'self', %*REG<curclass>, '"$!from"');
        $ops.push_pirop('ne', '$I11', -1, $donelabel);
        $ops.push_pirop('goto', $scanlabel);
        $ops.push($looplabel);
        $ops.push_pirop('inc', %*REG<pos>);
        $ops.push_pirop('gt', %*REG<pos>, %*REG<eos>, %*REG<fail>);
        $ops.push_pirop('repr_bind_attr_int', %*REG<cur>, %*REG<curclass>, '"$!from"', %*REG<pos>);
        $ops.push($scanlabel);
        self.regex_mark($ops, $looplabel, %*REG<pos>, 0);
        $ops.push($donelabel);
        $ops;
    }

While a bit intimidating at first, staring at it for a while quickly made clear what kind of code it emits. First three labels are generated, to which the code can jump with goto $label: One as a jump target for the loop that increments the cursor position ($looplabel), one for doing the regex match at that position ($scanlabel), and $donelabel for jumping to when the whole thing has finished.

Inside the loop there is an increment (inc) of the register the holds the current position (%*REG<pos>), that position is compared to the end-of-string position (%*REG<eos>), and if is larger, the cursor is marked as failed.

So the idea is to advance the position by one, and then instead of doing the regex match immediately, call the index op to find the next position where the regex might succeed:

--- a/src/vm/parrot/QAST/Compiler.nqp
+++ b/src/vm/parrot/QAST/Compiler.nqp
@@ -1564,7 +1564,13 @@ class QAST::Compiler is HLL::Compiler {
         $ops.push_pirop('goto', $scanlabel);
         $ops.push($looplabel);
         $ops.push_pirop('inc', %*REG<pos>);
-        $ops.push_pirop('gt', %*REG<pos>, %*REG<eos>, %*REG<fail>);
+        if nqp::elems($node.list) && $node.subtype ne 'ignorecase' {
+            $ops.push_pirop('index', %*REG<pos>, %*REG<tgt>, self.rxescape($node[0]), %*REG<pos>);
+            $ops.push_pirop('eq', %*REG<pos>, -1, %*REG<fail>);
+        }
+        else {
+            $ops.push_pirop('gt', %*REG<pos>, %*REG<eos>, %*REG<fail>);
+        }
         $ops.push_pirop('repr_bind_attr_int', %*REG<cur>, %*REG<curclass>, '"$!from"', %*REG<pos>);
         $ops.push($scanlabel);
         self.regex_mark($ops, $looplabel, %*REG<pos>, 0);

The index op returns -1 on failure, so the condition for a cursor fail are slightly different than before.

And as mentioned earlier, the optimization can only be safely done for matches that don't ignore case. Maybe with some additional effort that could be remedied, but it's not as simple as case-folding the target string, because some case folding operations can change the string length (for example ß becomes SS while uppercasing).

After successfully testing the patch, I came up with a small, artifical benchmark designed to show a difference in performance for this particular case. And indeed, it sped it up from 647 ± 28 µs to 161 ± 18 µs, which is roughly a factor of four.

You can see the whole thing as two commits on github.

What remains to do is implementing the same optimization on the JVM and MoarVM backends, and of course other optimizations. For example the Perl 5 regex engine keeps track of minimal and maximal string lengths for each subregex, and can anchor a regex like /a?b?longliteral/ to 0..2 characters before a match of longliteral, and generally use that meta information to fail faster.

But for now I am mostly encouraged that doing a worthwhile optimization was possible in a single evening without any black magic, or too intimate knowledge of the code generation.

Update: the code generation for MoarVM now also uses the index op. The logic is the same as for the parrot backend, the only difference is that the literal needs to be loaded into a register (whose name fresh_s returns) before index_s can use it.

Perlgeek.de : Quo Vadis Perl?

The last two days we had a gathering in town named Perl (yes, a place with that name exists). It's a lovely little town next to the borders to France and Luxembourg, and our meeting was titled "Perl Reunification Summit".

Sadly I only managed to arrive in Perl on Friday late in the night, so I missed the first day. Still it was totally worth it.

We tried to answer the question of how to make the Perl 5 and the Perl 6 community converge on a social level. While we haven't found the one true answer to that, we did find that discussing the future together, both on a technical and on a social level, already brought us closer together.

It was quite a touching moment when Merijn "Tux" Brand explained that he was skeptic of Perl 6 before the summit, and now sees it as the future.

We also concluded that copying API design is a good way to converge on a technical level. For example Perl 6's IO subsystem is in desperate need of a cohesive design. However none of the Perl 6 specification and the Rakudo development team has much experience in that area, and copying from successful Perl 5 modules is a viable approach here. Path::Class and IO::All (excluding the crazy parts) were mentioned as targets worth looking at.

There is now also an IRC channel to continue our discussions -- join #p6p5 on irc.perl.org if you are interested.

We also discussed ways to bring parallel programming to both perls. I missed most of the discussion, but did hear that one approach is to make easier to send other processes some serialized objects, and thus distribute work among several cores.

Patrick Michaud gave a short ad-hoc presentation on implicit parallelism in Perl 6. There are several constructs where the language allows parallel execution, for example for Hyper operators, junctions and feeds (think of feeds as UNIX pipes, but ones that allow passing of objects and not just strings). Rakudo doesn't implement any of them in parallel right now, because the Parrot Virtual Machine does not provide the necessary primitives yet.

Besides the "official" program, everybody used the time in meat space to discuss their favorite projects with everybody else. For example I took some time to discuss the future of doc.perl6.org with Patrick and Gabor Szabgab, and the relation to perl6maven with the latter. The Rakudo team (which was nearly completely present) also discussed several topics, and I was happy to talk about the relation between Rakudo and Parrot with Reini Urban.

Prior to the summit my expectations were quite vague. That's why it's hard for me to tell if we achieved what we and the organizers wanted. Time will tell, and we want to summarize the result in six to nine months. But I am certain that many participants have changed some of their views in positive ways, and left the summit with a warm, fuzzy feeling.

I am very grateful to have been invited to such a meeting, and enjoyed it greatly. Our host and organizers, Liz and Wendy, took care of all of our needs -- travel, food, drinks, space, wifi, accommodation, more food, entertainment, food for thought, you name it. Thank you very much!

Update: Follow the #p6p5 hash tag on twitter if you want to read more, I'm sure other participants will blog too.

Other blogs posts on this topic: PRS2012 – Perl5-Perl6 Reunification Summit by mdk and post-yapc by theorbtwo

Dave's Free Press: Journal: Wikipedia handheld proxy

Dave's Free Press: Journal: Bryar security hole

Dave's Free Press: Journal: Thankyou, Anonymous Benefactor!

Dave's Free Press: Journal: Number::Phone release

Dave's Free Press: Journal: Ill

Dave's Free Press: Journal: CPANdeps upgrade

Perlgeek.de : iPod nano 5g on linux -- works!

For Christmas I got an iPod nano (5th generation). Since I use only Linux on my home computers, I searched the Internet for how well it is supported by Linux-based tools. The results looked bleak, but they were mostly from 2009.

Now (December 2012) on my Debian/Wheezy system, it just worked.

The iPod nano 5g presents itself as an ordinary USB storage device, which you can mount without problems. However simply copying files on it won't make the iPod show those files in the play lists, because there is some meta data stored on the device that must be updated too.

There are several user-space programs that allow you to import and export music from and to the iPod, and update those meta data files as necessary. The first one I tried, gtkpod 2.1.2, worked fine.

Other user-space programs reputed to work with the iPod are rhythmbox and amarok (which both not only organize but also play music).

Although I don't think anything really depends on some particular versions here (except that you need a new enough version of gtkpod), here is what I used:

  • Architecture: amd64
  • Linux: 3.2.0-4-amd64 #1 SMP Debian 3.2.35-2
  • Userland: Debian GNU/Linux "Wheezy" (currently "testing")
  • gtkpod: 2.1.2-1

Dave's Free Press: Journal: CPANdeps

Dave's Free Press: Journal: Module pre-requisites analyser

Dave's Free Press: Journal: Perl isn't dieing

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 3

Perlgeek.de : The Fun of Running a Public Web Service, and Session Storage

One of my websites, Sudokugarden, recently surged in traffic, from about 30k visitors per month to more than 100k visitors per month. Here's the tale of what that meant for the server side.

As a bit of background, I built the website in 2007, when I knew a lot less about the web and programming. It runs on a host that I share with a few friends; I don't have root access on that machine, though when the admin is available, I can generally ask him to install stuff for me.

Most parts of the websites are built as static HTML files, with Server Side Includes. Parts of those SSIs are Perl CGI scripts. The most popular part though, which allows you to solve Sudoku in the browser and keeps hiscores, is written as a collection of Perl scripts, backed by a mysql database.

When at peak times the site had more than 10k visitors a day, lots of visitors would get a nasty mysql: Cannot connect: Too many open connections error. The admin wasn't available for bumping the connection limit, so I looked for other solutions.

My first action was to check the logs for spammers and crawlers that might hammered the page, and I found and banned some; but the bulk of the traffic looked completely legitimate, and the problem persisted.

Looking at the seven year old code, I realized that most pages didn't actually need a database connection, if only I could remove the session storage from the database. And, in fact, I could. I used CGI::Session, which has pluggable backend. Switching to a file-based session backend was just a matter of changing the connection string and adding a directory for session storage. Luckily the code was clean enough that this only affected a single subroutine. Everything was fine.

For a while.

Then, about a month later, the host ran out of free disk space. Since it is used for other stuff too (like email, and web hosting for other users) it took me a while to make the connection to the file-based session storage. What happened was 3 million session files on a ext3 file system with a block size of 4 kilobyte. A session is only about 400 byte, but since a file uses up a multiple of the block size, the session storage amounted to 12 gigabyte of used-up disk space, which was all that was left on that machine.

Deleting those sessions turned out to be a problem; I could only log in as my own user, which doesn't have write access to the session files (which are owned by www-data, the Apache user). The solution was to upload a CGI script that deleted the session, but of course that wasn't possible at first, because the disk was full. In the end I had to delete several gigabyte of data from my home directory before I could upload anything again. (Processes running as root were still writing to reserved-to-root portions of the file system, which is why I had to delete so much data before I was able to write again).

Even when I was able to upload the deletion script, it took quite some time to actually delete the session files; mostly because the directory was too large, and deleting files on ext3 is slow. When the files were gone, the empty session directory still used up 200MB of disk space, because the directory index doesn't shrink on file deletion.

Clearly a better solution to session storage was needed. But first I investigated where all those sessions came from, and banned a few spamming IPs. I also changed the code to only create sessions when somebody logs in, not give every visitor a session from the start.

My next attempt was to write the sessions to an SQLite database. It uses about 400 bytes per session (plus a fixed overhead for the db file itself), so it uses only a tenth of storage space that the file-based storage used. The SQLite database has no connection limit, though the old-ish version that was installed on the server doesn't seem to have very fine-grained locking either; within a few days I could errors that the session database was locked.

So I added another layer of workaround: creating a separate session database per leading IP octet. So now there are up to 255 separate session database (plus a 256th for all IPv6 addresses; a decision that will have to be revised when IPv6 usage rises). After a few days of operation, it seems that this setup works well enough. But suspicious as I am, I'll continue monitoring both disk usage and errors from Apache.

So, what happens if this solution fails to work out? I can see basically two approaches: move the site to a server that's fully under my control, and use redis or memcached for session storage; or implement sessions with signed cookies that are stored purely on the client side.

Perlgeek.de : YAPC Europe 2013 Day 2

The second day of YAPC Europe was enjoyable and informative.

I learned about ZeroMQ, which is a bit like sockets on steriods. Interesting stuff. Sadly Design decisions on p2 didn't quite qualify as interesting.

Matt's PSGI archive is a project to rewrite Matt's infamous script archive in modern Perl. Very promising, and a bit entertaining too.

Lunch was very tasty, more so than the usual mass catering. Kudos to the organizers!

After lunch, jnthn talked about concurrency, parallelism and asynchrony in Perl 6. It was a great talk, backed by great work on the compiler and runtime. Jonathans talk are always to be recommended.

I think I didn't screw up my own talk too badly, at least the timing worked fine. I just forgot to show the last slide. No real harm done.

I also enjoyed mst's State of the Velociraptor, which was a summary of what went on in the Perl world in the last year. (Much better than the YAPC::EU 2010 talk with the same title).

The Lightning talks were as enjoyable as those from the previous day. So all fine!

Next up is the river cruise, I hope to blog about that later on.

Perlgeek.de : Stop The Rewrites!

What follows is a rant. If you're not in the mood to read a rant right now, please stop and come back in an hour or two.

The Internet is full of people who know better than you how to manage your open source project, even if they only know some bits and pieces about it. News at 11.

But there is one particular instance of that advice that I hear often applied to Rakudo Perl 6: Stop the rewrites.

To be honest, I can fully understand the sentiment behind that advice. People see that it has taken us several years to get where we are now, and in their opinion, that's too long. And now we shouldn't waste our time with rewrites, but get the darn thing running already!

But Software development simply doesn't work that way. Especially not if your target is moving, as is Perl 6. (Ok, Perl 6 isn't moving that much anymore, but there are still areas we don't understand very well, so our current understanding of Perl 6 is a moving target).

At some point or another, you realize that with your current design, you can only pile workaround on top of workaround, and hope that the whole thing never collapses.

Picture of
a Jenga tower
Image courtesy of sermoa

Those people who spread the good advice to never do any major rewrites again, they never address what you should do when you face such a situation. Build the tower of workarounds even higher, and pray to Cthulhu that you can build it robust enough to support a whole stack of third-party modules?

Curiously this piece of advice occasionally comes from people who otherwise know a thing or two about software development methodology.

I should also add that since the famous "nom" switchover, which admittedly caused lots of fallout, we had three major rewrites of subsystems (longest-token matching of alternative, bounded serialization and qbootstrap), All three of which caused no new test failures, and two of which caused no fallout from the module ecosystem at all. In return, we have much faster startup (factor 3 to 4 faster) and a much more correct regex engine.

Perlgeek.de : The REPL trick

A recent discussion on IRC prompted me to share a small but neat trick with you.

If there are things you want to do quite often in the Rakudo REPL (the interactive "Read-Evaluate-Print Loop"), it makes sense to create a shortcut for them. And creating shortcuts for often-used stuff is what programming languages excel at, so you do it right in Perl module:

use v6;
module REPLHelper;

sub p(Mu \x) is export {
    x.^mro.map: *.^name;
}

I have placed mine in $HOME/.perl6/repl.

And then you make sure it's loaded automatically:

$ alias p6repl="perl6 -I$HOME/.perl6/repl/ -MREPLHelper"
$ p6repl
> p Int
Int Cool Any Mu
>

Now you have a neat one-letter function which tells you the parents of an object or a type, in method resolution order. And a way to add more shortcuts when you need them.

Perlgeek.de : News in the Rakudo 2012.06 release

Rakudo development continues to progress nicely, and so there are a few changes in this month's release worth explaining.

Longest Token Matching, List Iteration

The largest chunk of development effort went into Longest-Token Matching for alternations in Regexes, about which Jonathan already blogged. Another significant piece was Patrick's refactor of list iteration. You probably won't notice much of that, except that for-loops are now a bit faster (maybe 10%), and laziness works more reliably in a couple of cases.

String to Number Conversion

String to number conversion is now stricter than before. Previously an expression like +"foo" would simply return 0. Now it fails, ie returns an unthrown exception. If you treat that unthrown exception like a normal value, it blows up with a helpful error message, saying that the conversion to a number has failed. If that's not what you want, you can still write +$str // 0.

require With Argument Lists

require now supports argument lists, and that needs a bit more explaining. In Perl 6 routines are by default only looked up in lexical scopes, and lexical scopes are immutable at run time. So, when loading a module at run time, how do you make functions available to the code that loads the module? Well, you determine at compile time which symbols you want to import, and then do the actual importing at run time:

use v6;
require Test <&plan &ok &is>;
#            ^^^^^^^^^^^^^^^ evaluated at compile time,
#                            declares symbols &plan, &ok and &is
#       ^^^                  loaded at run time

Module Load Debugging

Rakudo had some trouble when modules were precompiled, but its dependencies were not. This happens more often than it sounds, because Rakudo checks timestamps of the involved files, and loads the source version if it is newer than the compiled file. Since many file operations (including simple copying) change the time stamp, that could happen very easily.

To make debugging of such errors easier, you can set the RAKUDO_MODULE_DEBUG environment variable to 1 (or any positive number; currently there is only one debugging level, in the future higher numbers might lead to more output).

$ RAKUDO_MODULE_DEBUG=1 ./perl6 -Ilib t/spec/S11-modules/require.t
MODULE_DEBUG: loading blib/Perl6/BOOTSTRAP.pbc
MODULE_DEBUG: done loading blib/Perl6/BOOTSTRAP.pbc
MODULE_DEBUG: loading lib/Test.pir
MODULE_DEBUG: done loading lib/Test.pir
1..5
MODULE_DEBUG: loading t/spec/packages/Fancy/Utilities.pm
MODULE_DEBUG: done loading t/spec/packages/Fancy/Utilities.pm
ok 1 - can load Fancy::Utilities at run time
ok 2 - can call our-sub from required module
MODULE_DEBUG: loading t/spec/packages/A.pm
MODULE_DEBUG: loading t/spec/packages/B.pm
MODULE_DEBUG: loading t/spec/packages/B/Grammar.pm
MODULE_DEBUG: done loading t/spec/packages/B/Grammar.pm
MODULE_DEBUG: done loading t/spec/packages/B.pm
MODULE_DEBUG: done loading t/spec/packages/A.pm
ok 3 - can require with variable name
ok 4 - can call subroutines in a module by name
ok 5 - require with import list

Module Loading Traces in Compile-Time Errors

If module myA loads module myB, and myB dies during compilation, you now get a backtrace which indicates through which path the erroneous module was loaded:

$ ./perl6 -Ilib -e 'use myA'
===SORRY!===
Placeholder variable $^x may not be used here because the surrounding block
takes no signature
at lib/myB.pm:1
  from module myA (lib/myA.pm:3)
  from -e:1

Improved autovivification

Perl allows you to treat not-yet-existing array and hash elements as arrays or hashes, and automatically creates those elements for you. This is called autovivification.

my %h;
%h<x>.push: 1, 2, 3; # worked in the previous release too
push %h<y>, 4, 5, 6; # newly works in the 2012.06

Dave's Free Press: Journal: Travelling in time: the CP2000AN

Perlgeek.de : Localization for Exception Messages

Ok, my previous blog post wasn't quite as final as I thought.. My exceptions grant said that the design should make it easy to enable localization and internationalization hooks. I want to discuss some possible approaches and thereby demonstrate that the design is flexible enough as it is.

At this point I'd like to mention that much of the flexibility comes from either Perl 6 itself, or from the separation of stringifying and exception and generating the actual error message.

Mixins: the sledgehammer

One can always override a method in an object by mixing in a role which contains the method on question. When the user requests error messages in a different language, one can replace method Str or method message with one that generates the error message in a different language.

Where should that happen? The code throws exceptions is fairly scattered over the code base, but there is a central piece of code in Rakudo that turns Parrot-level exceptions into Perl 6 level exceptions. That would be an obvious place to muck with exceptions, but it would mean that exceptions that are created but not thrown don't get the localization. I suspect that's a fairly small problem in the real world, but it still carries code smell. As does the whole idea of overriding methods.

Another sledgehammer: alternative setting

Perl 6 provides built-in types and routines in an outer lexical scope known as a "setting". The default setting is called CORE. Due to the lexical nature of almost all lookups in Perl 6, one can "override" almost anything by providing a symbol of the same name in a lexical scope.

One way to use that for localization is to add another setting between the user's code and CORE. For example a file DE.setting:

my class X::Signature::Placeholder does X::Comp {
    method message() {
        'Platzhaltervariablen können keine bestehenden Signaturen überschreiben';
    }
}

After compiling, we can load the setting:

$ ./perl6 --target=pir --output=DE.setting.pir DE.setting
$ ./install/bin/parrot -o DE.setting.pbc DE.setting.pir
$ ./perl6 --setting=DE -e 'sub f() { $^x }'
===SORRY!===
Platzhaltervariablen können keine bestehenden Signaturen überschreiben
at -e:1

That works beautifully for exceptions that the compiler throws, because they look up exception types in the scope where the error occurs. Exceptions from within the setting are a different beast, they'd need special lookup rules (though the setting throws far fewer exceptions than the compiler, so that's probably manageable).

But while this looks quite simple, it comes with a problem: if a module is precompiled without the custom setting, and it contains a reference to an exception type, and then the l10n setting redefines it, other programs will contain references to a different class with the same name. Which means that our precompiled module might only catch the English version of X::Signature::Placeholder, and lets our localized exception pass through. Oops.

Tailored solutions

A better approach is probably to simply hack up the string conversion in type Exception to consider a translator routine if present, and pass the invocant to that routine. The translator routine can look up the error message keyed by the type of the exception, and has access to all data carried in the exception. In untested Perl 6 code, this might look like this:

# required change in CORE
my class Exception {
    multi method Str(Exception:D:) {
        return self.message unless defined $*LANG;
        if %*TRANSLATIONS{$*LANG}{self.^name} -> $translator {
            return $translator(self);
        }
        return self.message; # fallback
    }
}

# that's what a translator could write:

%*TRANSLATIONS<de><X::TypeCheck::Assignment> = {
        "Typenfehler bei Zuweisung zu '$_.symbol()': "
        ~ "'{$_.expected.^name}' erwartet, aber '{$_.got.^name} bekommen"
    }
}

And setting the dynamic language $*LANG to 'de' would give a German error message for type check failures in assignment.

Another approach is to augment existing error classes and add methods that generate the error message in different languages, for example method message-fr for French, and check their existence in Exception.Str if a different language is requested.

Conclusion

In conclusion there are many bad and enough good approaches; we will decide which one to take when the need arises (ie when people actually start to translate error messages).

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 1

Ocean of Awareness: Significant newlines? Or semicolons?

Should statements have explicit terminators, like the semicolon of Perl and the C language? Or should they avoid the clutter, and separate statements by giving whitespace syntactic significance and a real effect on the semantics, as is done in Python and Javascript?

Actually we don't have to go either way. As an example, let's look at some BNF-ish DSL. It defines a small calculator. At first glance, it looks as if this language has taken the significant-whitespace route -- there certainly are no explicit statement terminators.

:default ::= action => ::first
:start ::= Expression
Expression ::= Term
Term ::=
      Factor
    | Term '+' Term action => do_add
Factor ::=
      Number
    | Factor '*' Factor action => do_multiply
Number ~ digits
digits ~ [\d]+
:discard ~ whitespace
whitespace ~ [\s]+

The rule is that there isn't one

If we don't happen to like the layout of the above DSL, and rearrange it in various ways, we'll find that everything we try works. If we become curious about what exactly what the rules for newlines are, and look at the documentation, we won't find any. That's because there aren't any.

We can see this by thoroughly messing up the line structure:

:default ::= action => ::first :start ::= Expression Expression ::= Term
Term ::= Factor | Term '+' Term action => do_add Factor ::= Number |
Factor '*' Factor action => do_multiply Number ~ digits digits ~
[\d]+ :discard ~ whitespace whitespace ~ [\s]+

The script will continue to run just fine.

How does it work?

How does it work? Actually, pose the question this way: Can a human reader tell where the statements end? If the reader is not used to reading BNF, he might have trouble with this particular example but, for a language that he knows, the answer is simple: Yes, of course he can. So really the question is, why do we expect the parser to be so stupid that it cannot?

The only trick is that this is done without trickery. Marpa's DSL is written in itself, and Marpa's self-grammar describes exactly what a statement is and what it is not. The Marpa parser is powerful enough to simply take this self-describing DSL and act on it, finding where statements begin and end, much as a human reader is able to.

To learn more

This example was produced with the Marpa parser. Marpa::R2 is available on CPAN. The code for this example is based on that in the synopsis for its top-level document, but it is isolated conveniently in a Github gist.

A list of my Marpa tutorials can be found here. There are new tutorials by Peter Stuifzand and amon. The Ocean of Awareness blog focuses on Marpa, and it has an annotated guide. Marpa has a web page that I maintain and Ron Savage maintains another. For questions, support and discussion, there is the "marpa parser" Google Group. Comments on this post can be made there.

Dave's Free Press: Journal: Thanks, Yahoo!

Ocean of Awareness: Marpa has a new web page

Marpa has a new official public website, which Ron Savage has generously agreed to manage. For those who have not heard of it, Marpa is a parsing algorithm. It is new, but very much based on earlier work by Jay Earley, Joop Leo, John Aycock and R. Nigel Horspool. Marpa is intended to replace, and to go well beyond, recursive descent and the yacc family of parsers.

  • Marpa is fast. It parses in linear time:
    • all the grammar classes that recursive descent parses;
    • the grammar class that the yacc family parses;
    • in fact, all unambiguous grammars, as long as they are free of unmarked middle recursions; and
    • all ambiguous grammars that are unions of a finite set of any of the above grammars.
  • Marpa is powerful. Marpa will parse anything that can be written in BNF. This includes any mixture of left, right and middle recursions.
  • Marpa is convenient. Unlike recursive descent, you do not have to write a parser -- Marpa generates one from BNF. Unlike PEG or yacc, parser generation is unrestricted and exact. Marpa converts any grammar which can be written as BNF into a parser which recognizes everything in the language described by that BNF, and which rejects everything that is not in that language. The programmer is not forced to make arbitrary choices while parsing. If a rule has several alternatives, all of the alternatives are considered for as long as they might yield a valid parse.
  • Marpa is flexible. Like recursive descent, Marpa allows you to stop and do your own custom processing. Unlike recursive descent, Marpa makes available to you detailed information about the parse so far -- which rules and symbols have been recognized, with their locations, and which rules and symbols are expected next.

Comments

Comments on this post can be made in Marpa's Google group.

Dave's Free Press: Journal: POD includes

Perlgeek.de : First day at YAPC::Europe 2013 in Kiev

Today was the first "real" day of YAPC Europe 2013 in Kiev. In the same sense that it was the first real day, we had quite a nice "unreal" conference day yesterday, with a day-long Perl 6 hackathon, and in the evening a pre-conference meeting a Sovjet-style restaurant with tasty food and beverages.

The talks started with a few words of welcome, and then the announcement that the YAPC Europe next year will be in Sofia, Bulgaria, with the small side note that there were actually three cities competing for that honour. Congratulations to Sofia!

Larry's traditional keynote was quite emotional, and he had to fight tears a few times. Having had cancer and related surgeries in the past year, he still does his perceived duty to the Perl community, which I greatly appreciate.

Afterwards Dave Cross talked about 25 years of Perl in 25 minutes, which was a nice walk through some significant developments in the Perl world, though a bit hasty. Maybe picking fewer events and spending a bit more time on the selected few would give a smoother experience.

Another excellent talk that ran out of time was on Redis. Having experimented a wee bit with Redis in the past month, this was a real eye-opener on the wealth of features we might have used for a project at work, but in the end we didn't. Maybe we will eventually revise that decision.

Ribasushi talked about how hard benchmarking really is, and while I was (in principle) aware of that fact that it's hard to get right, there were still several significant factors that I overlooked (like the CPU's tendency to scale frequency in response to thermal and power-management considerations). I also learned that I should use Dumbbench instead of the Benchmark.pm core module. Sadly it didn't install for me (Capture::Tiny tests failing on Mac OS X).

The Perl 6 is dead, long live Perl 5 talk was much less inflammatory than the title would suggest (maybe due to Larry touching on the subject briefly during the keynote). It was mostly about how Perl 5 is used in the presenter's company, which was mildly interesting.

After tasty free lunch I attended jnthn's talk on Rakudo on the JVM, which was (as is typical for jnthn's talk) both entertaining and taught me something, even though I had followed the project quite a bit.

Thomas Klausner's Bread::Board by example made me want to refactor the OTRS internals very badly, because it is full of the anti-patterns that Bread::Board can solve in a much better way. I think that the OTRS code base is big enough to warrant the usage of Bread::Board.

I enjoyed Denis' talk on Method::Signatures, and was delighted to see that most syntax is directly copied from Perl 6 signature syntax. Talk about Perl 6 sucking creativity out of Perl 5 development.

The conference ended with a session of lighning talks, something which I always enjoy. Many lightning talks had a slightly funny tone or undertone, while still talking about interesting stuff.

Finally there was the "kick-off party", beverages and snacks sponsored by booking.com. There (and really the whole day, and yesterday too) I not only had conversations with my "old" Perl 6 friends, but also talked with many interesting people I never met before, or only met online before.

So all in all it was a nice experience, both from the social side, and from quality and contents of the talks. Venue and food are good, and the wifi too, except when it stops working for a few minutes.

I'm looking forward to two more days of conference!

(Updated: Fixed Thomas' last name)

Dave's Free Press: Journal: cgit syntax highlighting

Dave's Free Press: Journal: CPAN Testers' CPAN author FAQ

Perlgeek.de : Correctness in Computer Programs and Mathematical Proofs

While reading On Proof and Progress in Mathematics by Fields Medal winner Bill Thurston (recently deceased I was sorry to hear), I came across this gem:

The standard of correctness and completeness necessary to get a computer program to work at all is a couple of orders of magnitude higher than the mathematical community’s standard of valid proofs. Nonetheless, large computer programs, even when they have been very carefully written and very carefully tested, always seem to have bugs.

I noticed that mathematicians are often sloppy about the scope of their symbols. Sometimes they use the same symbol for two different meanings, and you have to guess from context which on is meant.

This kind of sloppiness generally doesn't have an impact on the validity of the ideas that are communicated, as long as it's still understandable to the reader.

I guess on reason is that most mathematical publications still stick to one-letter symbol names, and there aren't that many letters in the alphabets that are generally accepted for usage (Latin, Greek, a few letters from Hebrew). And in the programming world we snort derisively at FORTRAN 77 that limited variable names to a length of 6 characters.

Ocean of Awareness: Parsing: a timeline

1960: The ALGOL 60 spec comes out. It specifies, for the first time, a block structured language. The ALGOL committee is well aware that nobody knows how to parse such a language. But they believe that, if they specify a block-structured language, a parser for it will be invented. Risky as this approach is, it pays off ...

1961: Ned Irons publishes his ALGOL parser. In fact, the Irons parser is the first parser of any kind to be described in print. Ned's algorithm is a left parser -- a form of recursive descent. Unlike modern recursive descent, the Irons algorithm is general and syntax-driven. "General" means it can parse anything written in BNF. "Syntax-driven" (aka declarative) means that parser is actually created from the BNF -- the parser does not need to be hand-written.

1961: Almost simultaneously, hand-coded approaches to left parsing appear. These we would now recognize as recursive descent. Over the following years, hand-coded approaches will become more popular for left parsers than syntax-driven algorithms. Three factors are at work:

  • In the 1960's, memory and CPU are both extremely limited. Hand-coding pays off, even when the gains are small.
  • Pure left parsing is a very weak parsing technique. Hand-coding is often necessary to overcome its limits. This is as true today as it is in 1961.
  • Left parsing works well in combination with hand-coding -- they are a very good fit.

1965: Don Knuth invents LR parsing. Knuth is primarily interested in the mathematics. He describes a parsing algorithm, but it is not thought practical.

1968: Jay Earley invents the algorithm named after him. Like the Irons algorithm, Earley's algorithm is syntax-driven and fully general. Unlike the Irons algorithm, it does not backtrack. Earley's core idea is to track everything about the parse in tables. Earley's algorithm is enticing, but it has three major issues:

  • First, there is a bug in the handling of zero-length rules.
  • Second, it is quadratic for right recursions.
  • Third, the bookkeeping required to set up the tables is, by the standards of 1968 hardware, daunting.

1969: Frank DeRemer described a new variant of Knuth's LR parsing. DeRemer's LALR algorithm requires only a stack and a state table of quite manageable size.

1972: Aho and Ullmann describe a straightforward fix to the zero-length rule bug in Earley's original algorithm. Unfortunately, this fix involves adding even more bookkeeping to Earley's.

1975: Bell Labs converts its C compiler from hand-written recursive descent to DeRemer's LALR algorithm.

1977: The first "Dragon book" comes out. This soon-to-be classic textbook is nicknamed after the drawing on the front cover, in which a knight takes on a dragon. Emblazoned on the knight's lance are the letters "LALR". From here on out, to speak lightly of LALR will be to besmirch the escutcheon of parsing theory.

1987: Larry Wall introduces Perl 1. Perl embraces complexity like no previous language. Larry uses LALR very aggressively -- to my knowledge more aggressively than anyone before or since.

1991: Joop Leo discovers a way of speeding up right recursions in Earley's algorithm. Leo's algorithm is linear for just about every unambiguous grammar of practical interest, and many ambiguous ones as well. In 1991 hardware is six orders of magnitude faster than 1968 hardware, so that the issue of bookkeeping overhead had receded in importance. This is a major discovery. When it comes to the speed, the game has changed in favor of Earley algorithm. But Earley parsing is almost forgotten. It will be 20 years before anyone writes a practical implementation of Leo's algorithm.

1990's: Earley's is forgotten. So everyone in LALR-land is content, right? Wrong. Far from it, in fact. Users of LALR are making unpleasant discoveries. While LALR automatically generates their parsers, debugging them is so hard they could just as easily write the parser by hand. Once debugged, their LALR parsers are fast for correct inputs. But almost all they tell the users about incorrect inputs is that they are incorrect. In Larry's words, LALR is "fast but stupid".

2000: Larry Wall decides on a radical reimplementation of Perl -- Perl 6. Larry does not even consider using LALR again.

2002: Aycock&Horspool publish their attempt at a fast, practical Earley's parser. Missing from it is Joop Leo's improvement -- they seem not to be aware of it. Their own speedup is limited in what it achieves and the complications it introduces can be counter-productive at evaluation time. But buried in their paper is a solution to the zero-length rule bug. And this time the solution requires no additional bookkeeping.

2006: GNU announces that the GCC compiler's parser has been rewritten. For three decades, the industry's flagship C compilers have used LALR as their parser -- proof of the claim that LALR and serious parsing are equivalent. Now, GNU replaces LALR with the technology that it replaced a quarter century earlier: recursive descent.

2000 to today: With the retreat from LALR comes a collapse in the prestige of parsing theory. After a half century, we seem to be back where we started. If you took Ned Iron's original 1961 algorithm, changed the names and dates, and translated from code from the mix of assembler and ALGOL into Haskell, you would easily republish it today, and bill it as as revolutionary and new.

Marpa

Over the years, I had come back to Earley's algorithm again and again. Around 2010, I realized that the original, long-abandoned vision -- an efficient, practical, general and syntax-driven parser -- was now, in fact, quite possible. The necessary pieces had fallen into place.

Aycock&Hospool has solved the zero-length rule bug. Joop Leo had found the speedup for right recursion. And the issue of bookkeeping overhead had pretty much evaporated on its own. Machine operations are now a billion times faster than in 1968, and probably no longer relevant in any case -- caches misses are now the bottleneck.

But while the original issues with Earley's disappeared, a new issue emerged. With a parsing algorithm as powerful as Earley's behind it, a syntax-driven approach can do much more than it can with a left parser. But with the experience with LALR in their collective consciousness, few modern programmers are prepared to trust a purely declarative parser. As Lincoln said, "Once a cat's been burned, he won't even sit on a cold stove."

To be accepted, Marpa needed to allow procedure parsing, not just declarative parsing. So Marpa allows the user to specify events -- occurrences of symbols and rules -- at which declarative parsing pauses. While paused, the application can call procedural logic and single-step forward token by token. The procedural logic can hand control back over to syntax-driven parsing at any point it likes. The Earley tables can provide the procedural logic with full knowledge of the state of the parse so far: all rules recognized in all possible parses so far, and all symbols expected. Earley's algorithm is now a even better companion for hand-written procedural logic than recursive descent.

For more

For more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group.

Dave's Free Press: Journal: YAPC::Europe 2006 report: day 3

Header image by Tambako the Jaguar. Some rights reserved.