NEILB: The River of CPAN

This blog post describes a model that we found useful for talking about CPAN dependencies and reverse dependencies at the QA Hackathon. At the head of the river is Perl itself with the core modules. The river flows into the sea, which contains all distributions that aren't used by any other distribution. Other distributions sit somewhere along the river, their position determined by their reverse dependencies. This post introduces the core concepts, but nothing more.

PAL-Blog: Blog-Battle #12: Phobie

Der heutige Blog-Battle dreht sich um die Neigungen von Stoffen, sich (nicht) mit bestimmten anderen Stoffen zu vermischen. Auf den ersten Blick ein sehr schönes Thema, bei dem man sich gut auslassen kann. Aber nur auf den ersten Blick.

Perl Foundation News: Perl 6 Hague Grant Application

We have received the following Hague Grant application from Bart Wiegmans. Before we vote on this proposal we would like to have a period of community consultation for 10 days. Please leave feedback in the comments or if you prefer send email with your comments to karen at perlfoundation.org.

Name: Bart Wiegmans

Project Title: Advancing the MoarVM JIT

Synopsis:

Implement an advanced code generation algorithm for the MoarVM JIT compiler, leading to more efficient machine code for JIT compiled frames.

Benefits to Perl 6 Development:

This project will enable the MoarVM JIT compiler to generate smaller and more efficient code, producing greater performance and less memory traffic during execution. This will help to make Perl 6 more competitive with other languages. (Speed is generally regarded as a feature).

As a secondary benefit this will decouple the JIT intermediate representation from the MoarVM bytecode and the x64 machine code making it easier for developers to extend or to port the JIT to architectures other than x64, such as ARM.

As an example of the potential speedup I've created the following example demonstrating a 5x speedup on tight numeric code. Although that example is highly artificial it does demonstrate the limits of the current JIT rather well. Note also that the original hot loop uses 22 instructions whereas the new hot loop uses only 7 instructions.

Deliverables:

These deliverables are not ordered chronologically.

  1. An implementation of the code generation algorithm described below, including instructio selection tables for the x64 architecture
  2. A runtime representation of machine operations ('expressions') that will form the input to the code generator and is suitable for targeting to different architectures
  3. A patched version of DynASM capable of addressing the extended registers of x64
  4. Conversion of large parts of the current JIT to the new algorithm
  5. An extension API for REPR ops to insert (inline) expressions into the JIT in place of some operations
  6. A set of automated tests known to trigger JIT compilation to either the NQP or Rakudo Perl 6 test suite.
  7. Reports and documentation explaining the JIT and it's API

Project Details:

Since September 2014 MoarVM implements a JIT compiler for optimised frames. This JIT compiler has proven to be reliable after it became the default option in October 2014, but it does not generate very efficient machine code. This is because it is very simple (simplistic perhaps) and generates code by pasting together preformed blocks of machine code.

Because these machine code fragments are necessarily independent, they must store and load values from memory every time they execute, even if these values are never read afterwards or if they are only temporarily useful. This results in large code size and heavy memory traffic. Moreover, because each part is independent, even very simple optimizations across fragment boundaries are impossible or very hard. Thus, the potential performance increase gained by JIT compilation is limited by the current design.

In this project I propose to implement a more advanced code generation algorithm based on the paper by Aho et al, which I will not describe in detail here. What is relevant for this proposal is that I intend to embed it into the existing JIT compiler as an additional node type. This means that the current JIT compiler can be converted to use this new algorithm in small pieces. This also means that the new algorithm must be written to use the DynASM bytecode generation library. This in turn means that DynASM must be extended to support register addressing on x64 (it currently only does so on x86).

Although this is a large project it, it can be divided into 'inchstones' relatively cleanly, each of which can be divided further:

  • Implement a 'proof of concept' code generator using DynASM with limited register addressing. This is intended to find any limits to DynASM (or my understanding) beyond the register addressing limitation already discussed.
  • Eliminate the register addressing limitation. In practice this comes down to conditionally adding register extension prefix bytes before the actual instructions. Last year I discussed briefly how this should be done with Mike Pall, the author of DynASM and luajit.
  • Implement a machine-level expression graph data type, suitable for expressing MoarVM- evel concepts close to machine-level operations.
  • Actually implement the algorithm, including components like register selection, instruction selection, and value spills.
  • Convert current JIT compiler fragments to the new expression trees. Areas of specific interest include branching, C calls, VM routine calls, GC barriers, deoptimization and exception handlers. Each of these is a goal onto itself. The current JIT fragments can function as templates for the expression graph.
  • Implement an extension API for REPR's (object representations) to inline operations into JIT expressions.

Automated tests will be developed during development of features and collected over time. Given enough time and less-than-expected difficulties in implementation, I have many more ambitions, but I think this should provide enough work for now.

Project Schedule:

I estimate that this project will take 10 weeks of full-time work, excluding time spent studying in preparation. Around two to three weeks will be spent on creating a proof of concept code generator and extending DynASM for x64 register addressing. If this is successful I think around 2 weeks will be spent on implementing this for MoarVM, of which one week will be spent on the algorithm and another week on instruction and register tables. After this I'll be able to demonstrate a simple JIT-compiled frame using the new algorithm.

The next three weeks will be spent converting more complex parts until a large part of the current JIT has been converted. Finally in the last two weeks I will implement the extension API as well as implement one or more of such extensions, for example inlined array access.

I can begin working in the week starting Sunday June 14 2015, meaning I intend to finish Saturday August 22. The halfway milestone should then be reached Saturday July 18.

Report Schedule:

I will report on my work using my blog as I have done last year during GSoC. Blog entries will appear at least every two weeks and preferably more often. This blog is syndicated on the pl6anet.org news aggegration site. Furthermore, I will report often to the #perl6 and #moarvm channel of freenode to discuss my progress. I'm also willing to provide mailing-list announcements (e.g. on perl6-announce) in case of milestones.

Public Repository:

The final code will be hosted in the MoarVM public git repository at github.com/MoarVM/MoarVM as well as in MoarVM's fork of DynASM. Changes to DynASM will also be offered to the luajit project from which DynASM has been extracted. Initially work will be done in a separate branch so as to not disturb 'regular' users. Any proof-of-concept code will be hosted at my personal github account.

Grant Deliverables ownership/copyright and License Information:

MoarVM is licensed under the Artistic 2.0 license. DynASM is licensed under MIT. Copyright on MoarVM belongs to Jonathan Worthington and others (according to the current LICENSE file). Copyright to DynASM belongs to Mike Pall. I have naturally no authority to change the copyright on these projects. The Perl Foundation may of course have copyright on the patches applied for this project (although I don't know if that works out legally).

Bio:

I was the author of the current MoarVM JIT compiler during the 2014 edition of GSoC. Thus, I know it very well. I consider myself to have been a member of the Perl 6 community since 2012, when I wrote the mod_parrot module for apache httpd during the 2012 edition of GSoC. I wrote my first interpreter around 10 years ago.

In real life I've been studying physics, biology, and currently environmental science. I have, in general, lots of time available in the summer. Lastly, I obsess about speed and MoarVM / Rakudo Perl 6 not being as fast as they can bothers me.

Amount Requested:

I request $10.000 for 2,5 months of work. I'd suggest payment of $5000 after the first halfway milestone and $5000 after the project is finished.

Suggestions for Grant Manager:

Jonathan Worthington (jnthn), who also mentored me during the 2014 edition of Google Summer of Code. He is the lead developer on MoarVM and a large contributor to the Rakudo Perl 6 compiler, on which he has given many talks at Perl conferences and elsewhere.

Perl Hacks: Subroutines and Ampersands

I’ve had this discussion several times recently, so I thought it was worth writing a blog post so that I have somewhere to point people the next time it comes up.

Using ampersands on subroutine calls (&my_sub or &my_sub(...)) is never necessary and can have potentially surprising side-effects. It should, therefore, never be used and should particularly be avoided in examples aimed at beginners.

Using an ampersand when calling a subroutine has three effects.

  1. It disambiguates the code so the the Perl compiler knows for sure that it has come across a subroutine call.
  2. It turns off prototype checking.
  3. If you use the &my_sub form (i.e. without parentheses) then the current value of @_ is passed on to the called subroutine.

Let’s look at these three effects in a little more detail.

Disambiguating the code is obviously a good idea. But adding the ampersand is not the only way to do it. Adding a pair of parentheses to the end of the call (my_sub()) has exactly the same effect. And, as a bonus, it looks the same as subroutine calls do in pretty much every other programming language ever invented. I can’t think of a single reason why anyone would pick &my_sub over my_sub().

I hope we’re agreed that prototypes are unnecessary in most Perl code (perhaps that needs to be another blog post at some point). Of course there are a few good reasons to use them, but most of us won’t be using them most of the time. If you’re using them, then turning off prototype checking seems to be a bad idea. And if you’re not using them, then it doesn’t matter whether they’re checked or not. There’s no good argument here for  using ampersands.

Then we come to the invisible passing of @_ to the called subroutine. I have no idea why anyone ever thought this was a good idea. The perlsub documentation calls it “an efficiency mechanism” but admits that is it one “that new users may wish to avoid”. If you want @_ to be available to the called subroutine then just pass it in explicitly. Your maintenance programmer (and remember, that could be you in six months time) will be grateful and won’t waste hours trying to work out what is going on.

So, no, there is no good reason to use ampersands when calling subroutines. Please don’t use them.

There is, of course, one case where ampersands are still useful when dealing with subroutines – when you are taking a reference to an existing, named subroutine. But that’s the only case that I can think of.

What do you think? Have I missed something?

It’s unfortunate that a lot of the older documentation on CPAN (and, indeed, some popular beginners’ books) still perpetuate this outdated style. It would be great if we could remove it from all example code.

The post Subroutines and Ampersands appeared first on Perl Hacks.

NEILB: Notifying authors about their reverse dependencies

This blog post outlines an idea for a service that informs people when their CPAN distributions gain reverse dependencies. Many authors are probably not even aware that their distributions have reverse dependencies, and what the implications of that can be. Sending them an email gives us a chance to congratulate and engage authors, but also to educate and encourage them in some new practices.

NEILB: Recording DarkPAN dependencies on CPAN

This blog post outlines an idea for a service where people can register that they're using (i.e. dependent / reliant on) a CPAN distribution. This would provide additional information about which distributions underpin the Perl world, and if the registrants were contactable, it would help authors minimise breakages when making changes.

Perl Foundation News: Grants Update - April 2015

For the March round, we got no applications. The next round will be in May.

Grant updates:

Perl News: Perl 5.22 a preview

brian d foy has posted a summary of the changes for the upcoming Perl 5.22 release.

These include:

  • A safer ARGV
  • CGI.pm and Module::Build disappear

And much more, it’s worth a quick read

Sawyer X: Dancer Master Class at YAPC::NA 2015

If you're interested in web development and/or the Dancer web framework, we're running a very special Dancer Master Class at this year's YAPC::NA.

The Dancer Master Class (Programming the web with Dancer - watch for the transparent YAPC website header blocking the title in the link) will cover a variety of topics, starting from the ground up.

You will learn the concepts of web programming, how Dancer works internally (using the latest Dancer2 version!), our architectural recommendations for laying out the foundations of a proper web application, how to use Dancer in different deployment environments, and have a blast writing a website and a web API to practice and build upon all you learned, all in a single day.

The course will be conducted by two Dancer core developers: Mickey Nasriachi and myself, Sawyer X.

The course is also at an incredibly low price ($75 - cost price!), since Booking.com is sponsoring it. If you don't mind me saying, this is a special opportunity.

If you work at a company that uses Dancer, recommend this to your boss so they register you.

if you thought of writing a website using Dancer or wish to learn more about your favorite web framework (assuming it's Dancer), go ahead and register.

Feel free to recommend it to colleagues and friends who you think will benefit from it.

Did I mention there are limited seats? :)

Sawyer.

brian d foy: I broke plenv and cpanm, and the systems they run on

I didn't break them myself, but the Perl Power Tools project that I revitalized did. Well, that's not even strictly true. The Perl Power Tools, which install thin implementations of common Unix tools, highlighted a problem with the idea of tools such as plenv.

mic

For awhile perlbrew was a really hot idea. It downloads, compiles, and installs completely separate perls for you. When you want to make a particular perl the default, you run a command and symlinks are shuffled around. I never particularly liked the tool because I use several versions of Perl at the same time so I don't rely of what I might find in the PATH environment variable and what that symlink might point to. Changing symlinks affects the entire system and every other session. This is also the reason that #!/bin/env perl doesn't work for me. It always finds the first perl, which is not the one I usually want.

The plenv idea is a little different. You set environment variables and shim versions of the programs you actually want figure out where to look for the real programs of the same name. This only affects your session, which is a big improvement. I first encountered this idea through Mac OS X, but in a slightly different form. That system perl is actually two perls in one file. You play with VERSIONER_PERL_VERSION and some other environment variables and the system selects the right thing. Since I don't use the system perl, I haven't played with this in awhile, though. Try a strings on the perl file on Mac OS X and you should see stuff from two different versions. Someone can probably explain it better than me; I've never cared to investigate it.

And here's where Perl Power Tools highlighted a problem. plenv makes shims for things in the *path-to-perl*/bin directly. That's where you'll find cpan, cpanm, and your other favorite perl tools. plenv, a bash script, used common unix tools, such as head, to do its work. But, Perl Power Tools has its own head, which plenv sees and creates a shim for. That shim version calls plenv to see where head is, but it needs head to do that. Fork bomb. Drop the mic.

This beaks even if the Perl Power Tools version of head worked correctly, as I see it, because the fork bomb happens as it tries to find head. See GitHub issue #443 for cpanm or GitHub issue #78 for plenv.

But then that's the problem with assuming you know where something is (or that it is what it says on the tin). If I can install my own perl and put anything I like in there, or do something to fool you into installing anything I like in there (such as installing a module), I can change what happens on your system. It's a security issue I discuss in Mastering Perl. If you allow system to look through the PATH, you might not get what you expected. That's why taint checking cares so much about insecure paths and only accepts very particular paths in that environment variable. But, I also show in Mastering Perl how to defeat taint checking by doing the same thing with a fake perl.

Still, Perl Power Tools probably shouldn't drop stuff into something you already have in PATH (GitHub issue #19). They are really a tool of last resort that you should have to apply with a bit more consideration. They can be in their own directory and you can add that directory to PATH yourself, most likely at the end of PATH. That's something I can fix.

I can't fix the idea of plenv (or perlbrew). If it works for you, that has nothing to do with me. It was too much magic for me, even though in idle moments I keep thinking about how to write my own version (say, call it like myperl -v5.20 arg arg arg). The fix in plenv creates the unix utilities it needs in own shell code.

But, in all of this, there's some really cool things going on in Perl. The tools are getting more interesting, new ideas are getting out there, and when things happen, we all work it out. :)

perlancar's blog: Having your own queryable CPAN mirror using lcpan

perlancar's blog

minicpan

Many of you already know that you can easily download a mini version of CPAN (meaning only the latest versions of modules: the ones currently being indexed by PAUSE in 02packages.details.txt.gz, without the older versions residing in each author’s directory) for offline use using minicpan (installable via “cpanm -n CPAN::Mini”). It’s currently about 4.5GB so you can put it on your laptop. This offline CPAN mirror is useful when you develop during periods without (reliable) internet connection, e.g. in remote vacation places or during flight. To install CPAN modules from this offline mirror, you can simply do:

% cpanm --mirror /path/to/your/cpan --mirror-only -n Foo::Bar

You can setup a shell alias for that, to make it more convenient. To update all installed modules on your system using this mirror, you can do:

% cpan-outdated --mirror file:/path/to/your/cpan | cpanm --mirror /path/to/your/cpan --mirror-only -n

lcpan

What I’m introducing in this blog post is another tool called lcpan (short for “local cpan”) which uses minicpan to download the mirror, plus extracting META.yml/META.json from each release files and indexing the information from those meta files to a local SQLite database. The result is, in addition to being able to install modules locally, you can also query various things about modules/distributions/releases. To install lcpan, simply do:

% cpanm -n App::lcpan

After that, download and index your CPAN mirror using:

% lcpan update

By default CPAN mirror will be created at ~/cpan. If you want a different location, you can do so by creating a ~/lcpan.conf containing:

cpan=/path/to/your/cpan

Do “lcpan update” regularly (e.g. once a day) if you want to keep it up to date.

Installing modules

lcpan comes with a thin cpanm wrapper called lcpanm that makes it more convenient for you to install modules from your local CPAN mirror. Instead of:

% cpanm --mirror ~/cpan --mirror-only -n Foo::Bar

now you can simply type:

% lcpanm -n Foo::Bar

as lcpanm will read the lcpan configuration to find out where the local CPAN mirror is.

Querying your local CPAN mirror

Now for the fun part. As mentioned above, lcpan also creates a SQLite database (by default it’s in ~/cpan/index.db) that you can query, either using the lcpan tool itself (which already provides quite a few subcommands for querying various things) or, if needed, query directly yourself using DBI. First, the basics. To list authors:

% lcpan authors ;# just the CPAN IDs
% lcpan authors --detail ;# along with version, dist, etc
% lcpan authors BING ;# search
% lcpan authors --detail @yahoo ;# search

To list modules:

% lcpan modules ;# just the names
% lcpan modules --detail ;# along with version, dist, etc
% lcpan modules Foo ;# search
% lcpan modules --author PERLANCAR --dist App-lcpan ;# add some filters

To list distributions:

% lcpan dists ;# just the names
% lcpan dists --detail ;# along with version, dist, etc
% lcpan dists Foo ;# search
% lcpan dists --author PERLANCAR --latest ;# add some filters
% lcpan dists --author PERLANCAR --detail --nolatest ;# list old versions of distribution

To list releases:

% lcpan releases ;# just the names
% lcpan releases --detail ;# along with version, dist, etc
% lcpan releases Foo ;# search
% lcpan releases --author PERLANCAR --has-buildpl --has-metajson ;# add some filters

Dependencies information

One of the most important information I want to query (and the reason I created lcpan in the first place, and the aspect of lcpan which is already being used by other distributions) is dependencies. Instead of having to browse metacpan.org or call its API, if you have a fairly recent index, you can instead just query your local CPAN mirror index for dependencies information. To list what modules are required by a module (to be exact, what modules are required by the distribution that a module is in):

% lcpan deps Text::ANSITable
+------------------------------+-----------+----------+
| module                       | author    | version  |
+------------------------------+-----------+----------+
| Border::Style::Role          | PERLANCAR | 0        |
| Color::RGB::Util             | PERLANCAR | 0        |
| Color::Theme::Role::ANSI     | PERLANCAR | 0        |
| Data::Unixish::ANSI          | SHARYANTO | 0.02     |
| Data::Unixish::Apply         | PERLANCAR | 1.33     |
| DateTime                     | DROLSKY   | 0        |
| Function::Fallback::CoreOrPP | PERLANCAR | 0        |
| JSON                         | MAKAMAKA  | 0        |
| Log::Any                     | DAGOLDEN  | 0        |
| Module::List                 | ZEFRAM    | 0        |
| Moo                          | HAARG     | 0        |
| Package::MoreUtil            | PERLANCAR | 0        |
| Parse::VarName               | SHARYANTO | 0        |
| Term::App::Role::Attrs       | PERLANCAR | 0        |
| Text::ANSI::Util             | PERLANCAR | 0.08     |
| experimental                 | LEONT     | 0        |
| namespace::clean             | RIBASUSHI | 0        |
| perl                         |           | 5.010001 |
+------------------------------+-----------+----------+

To view recursive dependencies, add -R:

% lcpan deps -R Text::ANSITable
+--------------------------------+-----------+----------+
| module                         | author    | version  |
+--------------------------------+-----------+----------+
| Border::Style::Role            | PERLANCAR | 0        |
| Color::RGB::Util               | PERLANCAR | 0        |
| Color::Theme::Role::ANSI       | PERLANCAR | 0        |
|   Color::ANSI::Util            | PERLANCAR | 0        |
| Data::Unixish::ANSI            | SHARYANTO | 0.02     |
|   Data::Unixish::Util          | PERLANCAR | 1.43     |
| Data::Unixish::Apply           | PERLANCAR | 1.33     |
|   Number::Format               | WRW       | 0        |
|   Number::Format::Metric       | PERLANCAR | 0        |
|   Rinci                        | PERLANCAR | v1.1.67  |
|     DefHash                    | PERLANCAR | v1.0.6   |
|   String::Pad                  | PERLANCAR | 0        |
|   Syntax::Feature::EachOnArray | SHARYANTO | 0        |
|     Hash::FieldHash            | GFUJI     | 0        |
|     syntax                     | PHAYLON   | 0        |
|       Data::OptList            | RJBS      | 0.104    |
|         Params::Util           | ADAMK     | 0        |
|         Sub::Install           | RJBS      | 0.921    |
|   Text::sprintfn               | PERLANCAR | 0        |
|   Tie::Simple                  | HANENKAMP | 0        |
|   Unixish                      | SHARYANTO | v1.0.1   |
| DateTime                       | DROLSKY   | 0        |
|   DateTime::Locale             | DROLSKY   | 0.41     |
|   DateTime::TimeZone           | DROLSKY   | 1.74     |
|     Class::Singleton           | SHAY      | 1.03     |
|     List::AllUtils             | DROLSKY   | 0        |
|       List::MoreUtils          | REHSACK   | 0.28     |
|         Exporter::Tiny         | TOBYINK   | 0.038    |
|       List::Util               | PEVANS    | 1.31     |
|   Params::Validate             | DROLSKY   | 0.76     |
| Function::Fallback::CoreOrPP   | PERLANCAR | 0        |
|   Clone::PP                    | NEILB     | 0        |
| JSON                           | MAKAMAKA  | 0        |
| Log::Any                       | DAGOLDEN  | 0        |
| Module::List                   | ZEFRAM    | 0        |
| Moo                            | HAARG     | 0        |
| Package::MoreUtil              | PERLANCAR | 0        |
| Parse::VarName                 | SHARYANTO | 0        |
|   Exporter::Lite               | NEILB     | 0        |
| Term::App::Role::Attrs         | PERLANCAR | 0        |
|   Moo::Role                    | HAARG     | 0        |
|     Class::Method::Modifiers   | ETHER     | 1.1      |
|     Devel::GlobalDestruction   | HAARG     | 0.11     |
|     Role::Tiny                 | HAARG     | 2        |
|   Term::Detect::Software       | PERLANCAR | 0        |
|     File::Which                | PLICEASE  | 0        |
| Text::ANSI::Util               | PERLANCAR | 0.08     |
|   Text::WideChar::Util         | PERLANCAR | 0.10     |
|     Unicode::GCString          | NEZUMI    | 0        |
|       MIME::Charset            | NEZUMI    | v1.6.2   |
| experimental                   | LEONT     | 0        |
| namespace::clean               | RIBASUSHI | 0        |
|   B::Hooks::EndOfScope         | ETHER     | 0.12     |
|     Sub::Exporter::Progressive | FREW      | 0.001006 |
|   Package::Stash               | DOY       | 0.23     |
|     Dist::CheckConflicts       | DOY       | 0.02     |
|     Module::Implementation     | DROLSKY   | 0.06     |
|       Module::Runtime          | ZEFRAM    | 0.012    |
|       Try::Tiny                | DOY       | 0        |
| perl                           |           | 5.010001 |
+--------------------------------+-----------+----------+

There are several options provided by the deps subcommand, e.g. only listing dependencies for a certain relationship (e.g. recommends) or phase (e.g. configure instead of runtime), filtering by author, and so on. Reverse dependencies information is also available, because that’s just the other side of the same coin:

% lcpan rdeps Text::ANSITable
+-----------+-------------------------------------------+---------+
| author    | dist                                      | version |
+-----------+-------------------------------------------+---------+
| SHARYANTO | Data-Format-Pretty-Console                | 0.33    |
| PERLANCAR | Perinci-CmdLine-Classic                   | 1.49    |
| PERLANCAR | Perinci-CmdLine-Classic                   | 1.50    |
| PERLANCAR | Pod-Weaver-Section-BorderStyles-ANSITable | 0.03    |
| PERLANCAR | Text-ANSITable-ColorTheme-Extra           | 0.14    |
+-----------+-------------------------------------------+---------+

(Hm, rather embarassing isn’t it. Nobody but me is using it). The rdeps subcommand also has several options, which you can see using lcpan rdeps --help or by consulting the manpage.

Other stuffs

Lots of other stuffs are also provided, from the documentation:

% lcpan mod2dist Text::ANSITable::ColorTheme::Default ;# -> Text-ANSITable

% lcpan mod2rel  Text::ANSITable::ColorTheme::Default ;# -> Text-ANSITable-0.39.tar.gz
% lcpan mod2rel  Text::ANSITable --full-path          ;# -> /cpan/authors/id/P/PE/PERLANCAR/Text-ANSITable-0.39.tar.gz

% lcpan dist2rel Text-ANSITable             ;# -> Text-ANSITable-0.39.tar.gz
% lcpan dist2rel Text-ANSITable --full-path ;# -> /cpan/authors/id/P/PE/PERLANCAR/Text-ANSITable-0.39.tar.gz

% lcpan distmods Text-ANSITable ;# list modules in a distribution
Text::ANSITable
Text::ANSITable::BorderStyle::Default
Text::ANSITable::ColorTheme::Default
Text::ANSITable::StyleSet::AltRow

% lcpan authormods PERLANCAR   ;# list an author's modules
% lcpan authordists PERLANCAR  ;# list an author's dists
% lcpan authorrels PERLANCAR   ;# list an author's releases

# who are authors with the most number of releases?
% lcpan authors-by-rel-count

# who are authors with the most number of distributions?
% lcpan authors-by-dist-count

# who are authors with the most number of registered modules/packages?
% lcpan authors-by-mod-count

# show all other authors' distributions using one of your modules
% lcpan authorrdeps PERLANCAR --user-author-isnt PERLANCAR

# show your old releases (which you should probably delete from CPAN?)
% lcpan releases --author PERLANCAR --nolatest

# what are modules that are used the most by other distributions?
% lcpan mods-by-rdep-count

Other/prior work

CPAN::SQLite is a module which parses the three CPAN indexes 01mailrc.txt.gz, 02packages.details.txt.gz, and 03modlist.data.gz (which is now empty) into SQLite database. However, it does not index any dependency information, which I need.

CPANDB (and its companion generator CPANDB::Generator also indexes information into a SQLite database, but aside from CPAN it also downloads and indexes additional sources like PAUSE upload data and CPAN ratings. The downloads are quite huge (multigigabyte) and not incremental, making it less convenient to update daily.

Pinto has a different goal of creating and managing CPAN-like repository, but can surely be used to mirror CPAN and show you the dependencies information. However, the Pinto documentation warns about Pinto “not indexing exactly like PAUSE does”, so there might be minor/subtle differences.

Closing remarks

I’ll be adding more queries and subcommands as I see fit. If you have ideas, please send it my way. Or, if you want to add some stuffs, it’s welcome too. The code is on github, and adding a new subcommand should be easy and obvious.


brian d foy: Huh. Multiple beginning-of-line anchors work

Anchor

I've never had a reason to use multiple beginning-of-line anchors in a regex, so I wondered if it would work. I guess it does.

use v5.10;

my $string = <<'HERE';
This is line one
This is a cat
This is a dog
This is a lizard
This is a bird
That is a ostrich
HERE

my @matches = $string =~ m/
^This\ is\ a\ (\S+) \s+
^This\ is\ a\ (\S+)
/xmg;

say "Matches are @matches";

It works:

Matches are cat dog

Not that you'd ever want to do this. I was curious, I tried it, and now I know. That probably means I'm going to try to get it into production somehow.

The Effective Perler: Use v5.20 subroutine signatures

Subroutine signatures, in a rudimentary form, have shown up in Perl v5.20 as an experimental feature. After a few of years of debate and a couple of competing implementations, we have something we can use. And, because it was such a contentious subject, it got the attention a new feature deserves. They don’t have all the features we want (notably type and value constraints), but Perl is in a good position to add those later.

Perl’s roots were in simplicity and getting started as quickly as possible. We wanted to define subroutines without much work. Instead of creating signatures in a C header file and worrying about inputs and outputs, Larry made subroutines take in lists and return lists. Done and done.

This simplicity means you have to do quite a bit of work yourself. You have to process the input list, in @_, assign your own default values, and declare the variables to possibly store them. You ended up not saving much for the typical programmer.

To start, you need to enable the experimental feature (see Item 2. Enable new Perl features when you need them. and turn off the experimental warnings):

use v5.20;
use feature qw(signatures);
no warnings qw(experimental::signatures);

To define a signature, you use the spot after the subroutine name where so far Perl has only allowed a prototype. This is not a prototype though; it’s something different. Notably, the new subroutine signatures work with methods, whereas prototypes are compiler hints that method calls ignore:

use v5.10;

sub some_sub ($$) {
	say "I got $_[0] and $_[1]";
	}

some_sub( 'Buster', 'Mimi' );

main->some_sub();             # "works" just fine

some_sub( qw(Buster) );       # compile time error

But, you have a limited set of characters you can put in a prototype, and those exclude identifier characters (those we use to make names). If you’ve enabled this experimental feature and Perl see un-prototype like characters, it tries signatures instead. Note that Perl has another feature like this: the diamond operator, <>, which might actually be the glob operator if Perl sees glob characters in the argument. You can still use a prototype with signatures, but you probably shouldn’t use prototypes. The perlsub documentation shows you how you can use an attribute to make a prototype.

First, be aware that using a signature does not mess with the normal argument list in @_. You can still play with that yourself.

The simplest prototype

The simplest signature is like the simplest prototype. Like prototypes, the signature enforces the number of arguments. To make a constant in Perl you can use a subroutine that takes no arguments. This is essentially what the constant pragma does:

use v5.20;
use feature qw(signatures);
no warnings qw(experimental::signatures);

sub cat () { 'Buster' }

say "Cat is ", cat;

If you try to pass an argument, you’ll get an error but at runtime:

use v5.20;
use feature qw(signatures);
no warnings qw(experimental::signatures);

sub cat () { 'Buster' }
say "Running...";
say "Cat is ", cat( 'Mini');

The first say works, but the second fails when it calls cat incorrectly:

Running...
Too many arguments for subroutine at cat.pl line 7.

A prototype would have raised a compile-time error because the compiler already knows how many arguments there should be. This doesn’t mean that the experimental signatures might do that later, but the implementation now is more of a code mangler. Deparsing it (Use B::Deparse to see what perl thinks the code is) shows that the cat subroutine has a die triggered by a check of @_:

$ perl5.20.0 -MO=Deparse cat.pl
sub BEGIN {
    require v5.20;
}
sub cat {
    BEGIN {${^WARNING_BITS} = "\020\001\000\000\000P\004\000\000\000\000\000\000U\005"}
    use strict;
    use feature 'current_sub', 'evalbytes', 'fc', 'say', 'signatures', 'state', 'switch', 'unicode_strings', 'unicode_eval';
    no feature 'array_base';
    die 'Too many arguments for subroutine' unless @_ <= 0;
    ();
    'Buster';
}
BEGIN {${^WARNING_BITS} = "\020\001\000\000\000P\004\000\000\000\000\000\000U\005"}
use strict;
use feature 'current_sub', 'evalbytes', 'fc', 'say', 'signatures', 'state', 'switch', 'unicode_strings', 'unicode_eval';
no feature 'array_base';
say 'Cat is ', cat('Mini');
test.pl syntax OK

Don't get too hung up on that because it might be a temporary implementation detail. This does mean, however, that you can catch this error with eval:

use v5.20;
use feature qw(signatures);
no warnings qw(experimental::signatures);

sub cat () { 'Buster' }
say "Running...";
say "Cat is ", eval { cat( 'Mini') };

say "Caught error: $@" if $@;

Now we catch the error, but notice it comes from the line of the subroutine definition, not the point where you called the subroutine like you would with a croak:

Running...
Cat is 
Caught error: Too many arguments for subroutine at cat.pl line 5.

Mandatory, positional parameters

The meat of this feature is your ability to assign to variables in what many perceive as a prettier way. Instead of declaring variables, usually with my, and performing list operations on @_, you list the variables in the signature in the order you would assign from @_:

use v5.20;
use feature qw(signatures);
no warnings qw(experimental::signatures);

cat( 'Buster' );

sub cat ($cat) { 
	say "The cat is $cat";
	}

Again, this checks the number of parameters. With no arguments or more than one argument you get a runtime error.

You separate multiple arguments with commas, just like a list assignment:

use v5.20;
use feature qw(signatures);
no warnings qw(experimental::signatures);

animals( 'Buster', 'Nikki', 'Godzilla' );

sub animals ($cat, $dog, $lizard) { 
	say "The cat is $cat";
	say "The dog is $dog";
	say "The lizard is $lizard";
	}

These variables are lexical variables in the scope of the subroutine (as you'd see if you deparsed this code). That means you can't assign to special variables, which would cause an compile-time error. That is, except for $_, which is experimentally lexical from a v5.10 misadventure with given-when (Perl v5.16 now sets proper magic on lexical $_ and Use for() instead of given()).

Placeholder parameters

You don't have to name all of the parameters. You can use the lone $ to not immediately assign a value, probably because you'll process it yourself through @_. In this example we don't care about the second argument, but the signature still needs the right number of positions and in the right sequence:

use v5.20;
use feature qw(signatures);
no warnings qw(experimental::signatures);

animals( 'Buster', 'Nikki', 'Godzilla' );

sub animals ($cat, $, $lizard) {  # unnamed second parameter
	say "The cat is $cat";
	say "The lizard is $lizard";
	}

This is a bit tricky really. That second argument is mandatory even though you've neglected to give it a name. It will still be in @_ even though you haven't assigned it to a variable.

Slurpy parameters

At the end of the parameter list, you can have a slurpy parameter, which is either a named array or hash. You can do this is a list assignment too, but a list assignment lets you put it in the middle despite the fact that any succeeding elements get nothing:

my( $cat, $dog, @animals, $lizard ) # $lizard will never get anything
	= qw( Buster Nikki Ginger Mimi Godzilla );

In the subroutine signature, that slurpy thing has to be at the end of the list:

sub animals ( $cat, $dog, @animals ) { # @animals must be at the end
	...;
	}

The rest of the arguments past the second show up in @animals. But, here's the interesting thing; the number of things that can show up in the array can be zero.

Sometimes you may not want to completely specify the number of arguments that your subroutine may take, but you also don't want to create a named array, you can use a bare @ as placeholder to mean that the argument list is unlimited:

sub animals ( $cat, $dog, @ ) { # @ slurps the rest
	...;
	}

The hash can also be a slurpy parameter, and just like the slurpy array it must be at the end of the signature:

sub animals ( $cat, $dog, %args ) { # %args demands an even number
	...;
	}

For the hash, if there isn't an even number of elements left in @_, you'll get a runtime exception. You don't have to name the hash, and a bare % still demands an even number of elements:

sub animals ( $cat, $dog, % ) { # % still demands an even number
	...;
	}

Default values (optional parameters)

Perhaps the best feature of signatures are default values. You can use Perl to decide what the default values, even if that is a literal. However, you can only assign default values to optional parameters, which means that they have to appear after the mandatory arguments. In this example, the third argument is optional and gets the default value 'MechaGodzilla' when no argument is present:

use v5.20;
use feature qw(signatures);
no warnings qw(experimental::signatures);

say "First try-----------";
animals( 'Mimi', 'Nikki', 'Godzilla' );

say "Second try-----------";
animals( 'Buster', 'Nikki', );

sub animals ( $cat, $dog, $lizard = 'MechaGodzilla' ) {
	say "The cat is $cat";
	say "The dog is $dog";
	say "The lizard is $lizard";
	}

On the second try, you get the default value:

First try-----------
The cat is Mimi
The dog is Nikki
The lizard is Godzilla
Second try-----------
The cat is Buster
The dog is Nikki
The lizard is MechaGodzilla

This is only checking the number of arguments and assigning a value when the argument list is too short. If you pass undef as an argument, that's the (un)value that parameter will get:

use v5.20;
use feature qw(signatures);
no warnings qw(experimental::signatures);

say "First try-----------";
animals( 'Mimi', 'Nikki', 'Godzilla' );

say "Second try-----------";
animals( 'Buster', 'Nikki', );

say "undef try-----------";
animals( 'Buster', 'Nikki', undef );

sub animals ( $cat, $dog, $lizard = 'MechaGodzilla', $bird = 'Poppy' ) {
	say "The cat is $cat";
	say "The dog is $dog";
	say "The lizard is $lizard";
	say "The bird is $bird";
	}

The undef does not trigger a default value, which may surprise many of you. Notice the third, "undef try" where $lizard gets no value:

First try-----------
The cat is Mimi
The dog is Nikki
The lizard is Godzilla
The bird is Poppy
Second try-----------
The cat is Buster
The dog is Nikki
The lizard is MechaGodzilla
The bird is Poppy
undef try-----------
The cat is Buster
The dog is Nikki
The lizard is 
The bird is Poppy

You can also have a null default value. You might want to have one optional parameter but not assign a value if the argument list isn't long enough. This signature allows one or two arguments, with no defaults:

use v5.20;
use feature qw(signatures);
no warnings qw(experimental::signatures);

say "First try-----------";
animals( 'Mimi', 'Nikki' );

say "Second try-----------";
animals( 'Buster' );

sub animals ( $cat, $= ) {  # second argument is optional
	say "The cat is $cat";
	say "Second argument is $_[1]" if $#_ == 1;
	}

You see that the second argument only exists if you specify it yourself:

First try-----------
The cat is Mimi
Second argument is Nikki
Second try-----------
The cat is Buster

These default values don't work with slurpy types, though. Perl will complain at compile-time. If you want that sort of thing, though, you can make the argument a scalar and assign an array or hash reference to it:

use v5.20;
use feature qw(signatures);
no warnings qw(experimental::signatures);

animals( 'Buster' );

sub animals ( $cat, $hash={} ) {  # hashref argument is optional
	say "The cat is $cat";
	...;
	}

Fancy optional values

So far your defaults have been simple values, but you can use Perl. That means you can do almost anything. This one uses the value in another variable and increments it as it assigns defaults:

use v5.20;
use feature qw(signatures);
no warnings qw(experimental::signatures);

animals( 'Mimi' );
animals( 'Buster' );

{
my $auto_id = 0;
sub animals ( $cat, $id = ++$auto_id ) {
	say "$id: The cat is $cat";
	}
}

Each cat automatically gets its own sequence value since the animals subroutine closed over $auto_id:

1: The cat is Mimi
2: The cat is Buster

However, you can't do something tricky to bring $auto_id into the subroutine since the parser doesn't know about the variable soon enough. Neither of these work:

sub animals ( $cat, $id = do { state $auto_id = 0; $auto_id++ } ) {
	say "$auto_id: The cat is $cat";
	}

sub animals ( $cat, $id = $auto_id++ ) {
	state $auto_id++;
	say "$auto_id: The cat is $cat";
	}

You can make the default value a call to a subroutine:

use v5.20;
use feature qw(signatures);
no warnings qw(experimental::signatures);

animals( 'Mimi' );
animals( 'Buster' );

sub animals ( $cat, $auto_id = get_id() ) {
	say "$auto_id: The cat is $cat";
	}

sub get_id () {
	state $id = 0;
	++$id;
	}

Method signatures

My favorite part of signatures is that they work with methods. This doesn't mean that we have multi-dispatch in Perl (yet) (well, Perl 6 does but that's a different language). The signatures aren't any different; they follow all the same rules:

use v5.20;

package Local::Animals {
	use feature qw(signatures);
	no warnings qw(experimental::signatures);

	sub new ( $class, $args = {'name' => 'Buster'} ) {
		bless $args, $class
		}

	sub name ( $self ) {
		$self->{name};
		}
	}

my $cat = Local::Animals->new;
say "Cat is ", $cat->name;

A quick guide

Exactly zero arguments - empty signature ()
Exactly one named argument ($cat)
Zero or one arguments, with no default, unnamed ($=)
Zero or one arguments, with no default, named ($cat=)
Zero or one arguments, with a default ($cat='Buster')
Exactly two arguments, ($cat, $dog)
Two or more arguments, any number ($cat, $dog, @)
Two or more arguments, but an even number ($cat, $dog, %)
Two or more arguments, slurping rest into an array ($cat, $dog, @animals)
Two or more arguments, slurping rest into an hash ($cat, $dog, %animals)
Two, three, or four arguments, no defaults ($cat, $dog, $=, $=)
Two, three, or four arguments, one default ($cat, $dog, $lizard='Godzilla', $=)
Class method ( $class, ... )
Object method ( $self, ... )

Things to Remember

  • Subroutine signatures are experimental
  • Signatures are not prototypes
  • Signatures enforce the number of arguments, but at runtime
  • Default values only apply to optional parameters, and only apply to parameters past the number of actual passed arguments

PAL-Blog: Die Rentenversicherung bringt uns um

Die Nachricht vom Tod eines nahestehenden Menschen ist normalerweise kein Grund zur Freude. Interessant wird es allerdings, wenn der Betroffene einen selbst von seinem - bereits erfolgten - Ableben berichtet.

PAL-Blog: Blog-Battle #11: Vergissmeinnicht

Oh Justine, was hast Du da nur angerichtet. Ein Thema, zu dem mir sofort klar war, über was ich schreiben werde und eben so klar, dass es ein ganz anderer Battle-Post werden wird, als die bisherigen. Es geht nicht um Blumen, Sarkasmus oder schein-philosophische Betrachtungen, sondern um die harte, gemeine Realität.

Perlgeek.de : All Perl 6 modules in a box

Sometimes when we change things in the Perl 6 language or the Rakudo Perl 6 compiler that implements it, we want to know if the planned changes will cause fallout in the library modules out there, and how much.

To get a quick estimate, we can now do a git grep in the experimental perl6-all-modules repository.

This is an attempt to get all the published module into a single git repository. It is built using git subrepo, an unofficial git extension module that I've been wanting to try for some time, and that seems to have some advantages over submodules in some cases. The notable one in this case being that git grep ignores submodules, but descends into subrepos just fine.

Here is the use case that made me create this repository: Rakudo accesses low-level operations through the nqp:: pseudo namespace. For example nqp::concat_s('a', 'b') is a low-level way to concatenate two strings. User-level programs can also use nqp:: ops, though it is generally a bad idea, because it ties the program to the particular compiler used, and what's more, the nqp:: ops are not part of the public API, and thus neither documented in the same place as the rest of Perl 6, nor are there any promises for stability attached.

So we want to require module authors to use a pragma, use nqp; in order to make their use of compiler internal explicit and deliberate. And of course, where possible, we want them to not use them at all :-)

To find out how many files in the ecosystem use nqp:: ops, a simple command, combined with the power of the standard UNIX tools, will help:

$ git grep -l 'nqp::'|wc -l
32

That's not too bad, considering we have... how many modules/distributions again?

Since they are added in author/repo structure, counting them with ls and wc isn't hard:

ls -1d */*/|wc -l
282

Ok, but number of files in relation to distributions isn't really useful. So let's ask: how many distributions directly use nqp:: ops?

$ git grep -l nqp:: | cut -d/ -f1,2 |sort -u|wc -l
23

23 out of 282 (or about 8%) distributions use the nqp:: syntax.

By the way, there is a tool (written in Perl 6, of course) to generate and update the repository. Not perfect yet, very much a work in progress. It's in the _tools folder, so you should probably filter out that directory in your queries (though in the examples above, it doesn't make a difference).

So, have fun with this new toy!

Dave's Free Press: Journal: Module pre-requisites analyser

Dave's Free Press: Journal: CPANdeps

Dave's Free Press: Journal: Perl isn't dieing

Perlgeek.de : A new Perl 6 community server - update

In my previous post I announced my plans for a new Perl 6 community server (successor to feather.perl6.nl), and now I'd like to share some updates.

Thanks to the generosity of the Perl 6 community, the server has been ordered and paid. I am now in the process of contacting those donors who haven't paid yet, leaving them the choice to re-purpose their pledge to ongoing costs (traffic, public IPv4 addresses, domain(s), SSL certs if necessary) and maintenance, or withdraw their pledges.

Some details of the hardware we'll get:

  • CPU: Intel® Xeon® Haswell-EP Series Processor E5-2620 v3, 2.40 GHz, 6-Core Socket 2011-3, 15MB Cache
  • RAM: 4x8GB DDR4 DDR4 PC2133 Reg. ECC 2R
  • HD: 2x 2TB SATA3-HD

The vendor has told me that all parts have arrived, and will be assembled today or tomorrow.

Currently I lean towards using KVM to create three virtual hosts: one for websites (*.perl6.org, perlcabal.syn), one for general hacking and IRC activity, and one for high-risk stuff (evalbots, try.rakudo.org, ...).

I've secured the domain p6c.org (for "perl 6 community"), and the IPv4 range 213.95.82.52 - 213.95.82.62 and the IPv6 net 2001:780:101:ff00::/64.

So the infrastructure is in place, now I'm waiting for the delivery of the hardware.

Perlgeek.de : doc.perl6.org: some stats, future directions

In June 2012 I started the perl6/doc repository with the intent to collect/write API documentation for Perl 6 built-in types and routines. Not long afterwards, the website doc.perl6.org was born, generated from the aforementioned repository.

About 2.5 years later, the repository has seen more than one thousand commits from more than 40 contributors, 14 of which contributed ten patches or more. The documentation encompasses about 550 routines in 195 types, with 15 documents for other things than built-in types (for example an introduction to regexes, descriptions of how variables work).

In terms of subjective experience, I observed an increase in the number of questions on our IRC channel and otherwise that could be answered by pointing to the appropriate pages of doc.perl6.org, or augmenting the answer with a statement like "for more info, see ..."

While it's far from perfect, I think both the numbers and the experience is very encouraging, and I'd like to thank everybody who helped make that happen, often by contributing skills I'm not good at: front-end design, good English and gentle encouragement.

Plans for the Future

Being a community-driven project, I can't plan anybody else's time on it, so these are my own plans for the future of doc.perl6.org.

Infrastructural improvements

There are several unsolved problems with the web interface, with how we store our documents, and how information can be found. I plan to address them slowly but steadily.

  • The search is too much centered around types and routines, searching for variables, syntactic constructs and keywords isn't easily possible. I want it to find many more things than right now.
  • Currently we store the docs for each type in a separate file called Type.pod. Which will break when we start to document native types, which being with lower case letters. Having int.pod and Int.pod is completely unworkable on case-insensitive or case-preserving file system. I want to come up with a solution for that, though I don't yet know what it will look like.
  • doc.perl6.org is served from static pages, which leads to some problems with file names conflicting with UNIX conventions. You can't name a file infix:</>.html, and files with two consecutive dots in their names are also weird. So in the long run, we'll have to switch to some kind of dynamic URL dispatching, or a name escaping scheme that is capable of handling all of Perl 6's syntax.
  • Things like the list of methods and what they coerce to in class Cool don't show up in derived types; either the tooling needs to be improved for that, or they need to be rewritten to use the usual one-heading-per-method approach.

Content

Of course my plan is to improve coverage of the built-in types and routines, and add more examples. In addition, I want to improve and expand on the language documentation (for example syntax, OO, regexes, MOP), ideally documenting every Perl 6 feature.

Once the language features are covered in sufficient breadth and depth (though I won't wait for 100% coverage), I want to add three tutorial tracks:

  • A track for beginners
  • A quick-start for programmers from other languages
  • A series of intermediate to advanced guides covering topics such as parsing, how to structure a bigger application, the responsible use of meta programming, or reactive programming.

Of course I won't be able to do that all on my own, so I hope to convince my fellow and future contributors that those are good ideas.

Time to stop rambling about the future, and off to writing some docs, this is yours truly signing off.

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 3

Perlgeek.de : The Fun of Running a Public Web Service, and Session Storage

One of my websites, Sudokugarden, recently surged in traffic, from about 30k visitors per month to more than 100k visitors per month. Here's the tale of what that meant for the server side.

As a bit of background, I built the website in 2007, when I knew a lot less about the web and programming. It runs on a host that I share with a few friends; I don't have root access on that machine, though when the admin is available, I can generally ask him to install stuff for me.

Most parts of the websites are built as static HTML files, with Server Side Includes. Parts of those SSIs are Perl CGI scripts. The most popular part though, which allows you to solve Sudoku in the browser and keeps hiscores, is written as a collection of Perl scripts, backed by a mysql database.

When at peak times the site had more than 10k visitors a day, lots of visitors would get a nasty mysql: Cannot connect: Too many open connections error. The admin wasn't available for bumping the connection limit, so I looked for other solutions.

My first action was to check the logs for spammers and crawlers that might hammered the page, and I found and banned some; but the bulk of the traffic looked completely legitimate, and the problem persisted.

Looking at the seven year old code, I realized that most pages didn't actually need a database connection, if only I could remove the session storage from the database. And, in fact, I could. I used CGI::Session, which has pluggable backend. Switching to a file-based session backend was just a matter of changing the connection string and adding a directory for session storage. Luckily the code was clean enough that this only affected a single subroutine. Everything was fine.

For a while.

Then, about a month later, the host ran out of free disk space. Since it is used for other stuff too (like email, and web hosting for other users) it took me a while to make the connection to the file-based session storage. What happened was 3 million session files on a ext3 file system with a block size of 4 kilobyte. A session is only about 400 byte, but since a file uses up a multiple of the block size, the session storage amounted to 12 gigabyte of used-up disk space, which was all that was left on that machine.

Deleting those sessions turned out to be a problem; I could only log in as my own user, which doesn't have write access to the session files (which are owned by www-data, the Apache user). The solution was to upload a CGI script that deleted the session, but of course that wasn't possible at first, because the disk was full. In the end I had to delete several gigabyte of data from my home directory before I could upload anything again. (Processes running as root were still writing to reserved-to-root portions of the file system, which is why I had to delete so much data before I was able to write again).

Even when I was able to upload the deletion script, it took quite some time to actually delete the session files; mostly because the directory was too large, and deleting files on ext3 is slow. When the files were gone, the empty session directory still used up 200MB of disk space, because the directory index doesn't shrink on file deletion.

Clearly a better solution to session storage was needed. But first I investigated where all those sessions came from, and banned a few spamming IPs. I also changed the code to only create sessions when somebody logs in, not give every visitor a session from the start.

My next attempt was to write the sessions to an SQLite database. It uses about 400 bytes per session (plus a fixed overhead for the db file itself), so it uses only a tenth of storage space that the file-based storage used. The SQLite database has no connection limit, though the old-ish version that was installed on the server doesn't seem to have very fine-grained locking either; within a few days I could errors that the session database was locked.

So I added another layer of workaround: creating a separate session database per leading IP octet. So now there are up to 255 separate session database (plus a 256th for all IPv6 addresses; a decision that will have to be revised when IPv6 usage rises). After a few days of operation, it seems that this setup works well enough. But suspicious as I am, I'll continue monitoring both disk usage and errors from Apache.

So, what happens if this solution fails to work out? I can see basically two approaches: move the site to a server that's fully under my control, and use redis or memcached for session storage; or implement sessions with signed cookies that are stored purely on the client side.

Ocean of Awareness: PEG: Ambiguity, precision and confusion

Precise?

PEG parsing is a new notation for a notorously tricky algorithm that goes back to the earliest computers. In its PEG form, this algorithm acquired an seductive new interface, one that looks like the best of extended BNF combined with the best of regular expressions. Looking at a sample of it, you are tempted to imagine that writing a parser has suddenly become a very straightforward matter. Not so.

For those not yet in the know on this, I'll illustrate with a pair of examples from an excellent 2008 paper by Redziejowski. Let's start with these two PEG specifications.

    ("a"|"aa")"a"
    ("aa"|"a")"a"
    

One of these two PEG grammars accepts the string "aaa" but not the string "aa". The other does the opposite -- it accepts the string the string "aa" but not the string "aaa". Can you tell which one? (For the answer, see page 4 of Redziejowski 2008.)

Here is another example:

    A = "a"A"a"/"aa"
    

What language does this describe? All the strings in the language are obviously the letter "a", repeated some number of times. But which string lengths are in the language, and which are not? Again the answer is on page 4 of Redziejowski 2008 -- it's exactly those strings whose length is a power of 2.

With PEG, what you see in the extended BNF is not what you get. PEG parsing has been called "precise", apparently based on the idea that PEG parsing is in a certain sense unambiguous. In this case "precise" is taken as synonymous with "unique". That is, PEG parsing is precise in exactly the same sense that Jimmy Hoffa's body is at a precise location. There is (presumably) exactly one such place, but we are hard put to be any more specific about the matter.

Syntax-driven?

The advantage of using a syntax-driven parser generator is that the syntax you specify describes the language that will be parsed. For most practical grammars, PEG is not syntax-driven in this sense. Several important PEG researchers understand this issue, and have tried to deal with it. I will talk about their work below. This is much more at stake than bragging rights over which algorithm is really syntax-driven and which is not.

When you do not know the language your parser is parsing, you of course have the problem that your parser might not parse all the strings in your language. That can be dealt with by fixing the parser to accept the correct input, as you encounter problems.

A second, more serious, problem is often forgotten. Your PEG parser might accept strings that are not in your language. At worst, this creates a security loophole. At best, it leaves with a choice: break compatiblity, or leave the problem unfixed.

It's important to be able to convince yourself that your code is correct by examining it and thinking about it. Beginning programmers often simply hack things, and call code complete once it passes the test suite. Test suites don't catch everything, but there is a worse problem with the beginner's approach.

Since the beginner has no clear idea of why his code works, even when it does, it is unlikely to be well-organized or readable. Programming techniques like PEG, where the code can be made to work, but where it is much harder, and in practice usually not possible, to be sure why the code works, become maintenance nightmares.

The maintenance implications are especially worrisome if the PEG parser is for a language with a life cycle that may involve bug fixes or other changes. The impact of even small changes to a PEG specification is hard to predict and hard to discover after the fact.

Is PEG unambiguous?

PEG is not unambiguous in any helpful sense of that word. BNF allows you to specify ambiguous grammars, and that feature is tied to its power and flexibility and often useful in itself. PEG will only deliver one of those parses. But without an easy way of knowing which parse, the underlying ambiguity is not addressed -- it is just ignored.

My Marpa parser is a general BNF parser based on Earley's. It also can simply throw all but one of the parses in an ambiguous parse away. But I would not feel justified in saying to a user who has an issue with ambiguity, that Marpa has solved her problem by throwing all but one arbitrarily chosen result.

Sticking with Marpa for a moment, we can see one example of a more helpful approach to ambiguity. Marpa allows a user to rank rules, so that all but the highest ranking rules are not used in a parse. Marpa's rule rankings are specified in its BNF, and they work together with the BNF in an intuitive way. In every case, Marpa delivers precisely the parses its BNF and its rule rankings specify. And it is "precision" in this sense that a parser writer is looking for.

Is there a sensible way to use PEG?

I'll return to Marpa at the end of this post. For now, let's assume that you are not interested in using Marpa -- you are committed to PEG, and you want to make the best of PEG. Several excellent programmers have focused on PEG, without blinding themselves to its limitations. I've already mentioned one important paper by Redziejowski. Many of Redziejowski's collected papers are about PEG, and Redziejowski, in his attempts to use PEG, does not sugarcoat its problems.

Roberto Ierusalimschy, author of Lua and one of the best programmers of our time, has written a PEG-based parser of his own. Roberto is fully aware of PEG's limits, but he makes a very good case for choosing PEG as the basis of LPEG, his parser generator. LPEG is intended for use with Lua, a ruthlessly minimal language. Roberto's minimalist implementation limits the power of his parser, but his aim is to extend regular expressions in a disciplined way, and a compact parser of limited power is quite acceptable for his purposes.

Matching the BNF to the PEG spec

As Redziejowski and Ierusalimschy and the other authors of Mascarenhas et al, 2013 recognize, not knowing what language you are parsing is more than an annoyance. We can call a language "well-behaved for PEG" if the PEG spec delivers exactly the language the BNF describes.

Which languages are are well-behaved for PEG? According to Mascarenhas et al, 2013, the LL(1) languages are well-behaved. (The LL(1) languages are the languages a top-down parser can parse based on at most one character of input.) Syntax-driven parsers for LL(1) have been around for much longer than PEG -- one such parser is described in the first paper to describe recursive descent (Peter Lucas, 1961). But most practical languages are not LL(1). Redziejowski 2013 and Redziejowski 2014 seek to extend this result by defining the language class LL(1p) -- those top-down languages with one "parsing procedure" of lookahead. The LL(1p) languages are also well-behaved for PEG.

Mascarenhas et al, 2013 also look at a different approach -- instead of writing a PEG specification and trying to keep it well-behaved, they look at taking languages from larger top-down classes and translating them to PEG. I don't know of any followup, but it's possible this approach could produce well-behaved top-down parsers which are an improvement over direct-from-PEG parsing. But for those who are open to leaving top-down parsing behind, a parser which handles languages in all these classes and more is already available.

Marpa

In this post, I have adopted the point of view of programmers using PEG, or thinking of doing so. My own belief in this matter is that very few programmers should want to bother with the issues I've just described. My reason for this is the Marpa parser -- a general BNF Earley-driven parser that

  • has an implementation you can use today;
  • allows the application to combine syntax-driven parsing with custom procedural logic;
  • makes available full, left-eidetic knowledge of the parse to the procedural logic;
  • and parses a vast class of grammars in linear time, including all the LR-regular grammars.

The LR-regular grammars include the LR(k) and LL(k) grammars for all k. LR-regular includes all the languages which are well-behaved under PEG, and all of those that Mascarenhas et al, 2013 consider translating into PEG.

Comments

Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net. To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.

Perlgeek.de : CPAN Pull Request Challenge: A call to the CPAN authors

The 2015 CPAN Pull Request Challenge is ramping up, and so far nearly two hundred volunteers have signed up, pledging to make one pull request for a CPAN distribution for each month of the year.

So here's a call to the all the CPAN authors: please be supportive, and if you don't like for your CPAN distributions to be part of the challenge, please send an email to neil at bowers dot com, stating your PAUSE ID and the fact that you want to be excluded.

How to be supportive? The first step is to act on pull requests. If you don't have time for a review, please say so; getting some response, even if it's "it'll be some time 'till I get around to reviewing this" is much better than none.

The volunteers have varied backgrounds; some are seasoned veterans, others are beginners who will make their first contribution to Open Source. So please be patient and encouraging.

If you have specific requirements for contributions, add a file called CONTRIBUTING or CONTRIBUTING.md to your github repositories where you state those requirements.

And of course, be civil. But that goes without saying, right? :-)

(And to those CPAN authors who haven't done it yet: put your distributions on github, so that you're not left out of the fun!

Happy New Year everybody, and have a great deal of fun!

See also: Resources for the CPAN Pull Request Challenge.

Dave's Free Press: Journal: Devel::CheckLib can now check libraries' contents

Perlgeek.de : Rakudo's Abstract Syntax Tree

After or while a compiler parses a program, the compiler usually translates the source code into a tree format called Abstract Syntax Tree, or AST for short.

The optimizer works on this program representation, and then the code generation stage turns it into a format that the platform underneath it can understand. Actually I wanted to write about the optimizer, but noticed that understanding the AST is crucial to understanding the optimizer, so let's talk about the AST first.

The Rakudo Perl 6 Compiler uses an AST format called QAST. QAST nodes derive from the common superclass QAST::Node, which sets up the basic structure of all QAST classes. Each QAST node has a list of child nodes, possibly a hash map for unstructured annotations, an attribute (confusingly) named node for storing the lower-level parse tree (which is used to extract line numbers and context), and a bit of extra infrastructure.

The most important node classes are the following:

QAST::Stmts
A list of statements. Each child of the node is considered a separate statement.
QAST::Op
A single operation that usually maps to a primitive operation of the underlying platform, like adding two integers, or calling a routine.
QAST::IVal, QAST::NVal, QAST::SVal
Those hold integer, float ("numeric") and string constants respectively.
QAST::WVal
Holds a reference to a more complex object (for example a class) which is serialized separately.
QAST::Block
A list of statements that introduces a separate lexical scope.
QAST::Var
A variable
QAST::Want
A node that can evaluate to different child nodes, depending on the context it is compiled it.

To give you a bit of a feel of how those node types interact, I want to give a few examples of Perl 6 examples, and what AST they could produce. (It turns out that Perl 6 is quite a complex language under the hood, and usually produces a more complicated AST than the obvious one; I'll ignore that for now, in order to introduce you to the basics.)

Ops and Constants

The expression 23 + 42 could, in the simplest case, produce this AST:

QAST::Op.new(
    :op('add'),
    QAST::IVal.new(:value(23)),
    QAST::IVal.new(:value(42)),
);

Here an QAST::Op encodes a primitive operation, an addition of two numbers. The :op argument specifies which operation to use. The child nodes are two constants, both of type QAST::IVal, which hold the operands of the low-level operation add.

Now the low-level add operation is not polymorphic, it always adds two floating-point values, and the result is a floating-point value again. Since the arguments are integers and not floating point values, they are automatically converted to float first. That's not the desired semantics for Perl 6; actually the operator + is implemented as a subroutine of name &infix:<+>, so the real generated code is closer to

QAST::Op.new(
    :op('call'),
    :name('&infix:<+>'),    # name of the subroutine to call
    QAST::IVal.new(:value(23)),
    QAST::IVal.new(:value(42)),
);

Variables and Blocks

Using a variable is as simple as writing QAST::Var.new(:name('name-of-the-variable')), but it must be declared first. This is done with QAST::Var.new(:name('name-of-the-variable'), :decl('var'), :scope('lexical')).

But there is a slight caveat: in Perl 6 a variable is always scoped to a block. So while you can't ordinarily mention a variable prior to its declaration, there are indirect ways to achieve that (lookup by name, and eval(), to name just two).

So in Rakudo there is a convention to create QAST::Block nodes with two QAST::Stmts children. The first holds all the declarations, and the second all the actual code. That way all the declaration always come before the rest of the code.

So my $x = 42; say $x compiles to roughly this:

QAST::Block.new(
    QAST::Stmts.new(
        QAST::Var.new(:name('$x'), :decl('var'), :scope('lexical')),
    ),
    QAST::Stmts.new(
        QAST::Op.new(
            :op('p6store'),
            QAST::Var.new(:name('$x')),
            QAST::IVal.new(:value(42)),
        ),
        QAST::Op.new(
            :op('call'),
            :name('&say'),
            QAST::Var.new(:name('$x')),
        ),
    ),
);

Polymorphism and QAST::Want

Perl 6 distinguishes between native types and reference types. Native types are closer to the machine, and their type name is always lower case in Perl 6.

Integer literals are polymorphic in that they can be either a native int or a "boxed" reference type Int.

To model this in the AST, QAST::Want nodes can contain multiple child nodes. The compile-time context decides which of those is acutally used.

So the integer literal 42 actually produces not just a simple QAST::IVal node but rather this:

QAST::Want.new(
    QAST::WVal(Int.new(42)),
    'Ii',
    QAST::Ival(42),
)

(Note that Int.new(42) is just a nice notation to indicate a boxed integer object; it doesn't quite work like this in the code that translate Perl 6 source code into ASTs).

The first child of a QAST::Want node is the one used by default, if no other alternative matches. The comes a list where the elements with odd indexes are format specifications (here Ii for integers) and the elements at even-side indexes are the AST to use in that case.

An interesting format specification is 'v' for void context, which is always chosen when the return value from the current expression isn't used at all. In Perl 6 this is used to eagerly evaluate lazy lists that are used in void context, and for several optimizations.

Dave's Free Press: Journal: I Love Github

Dave's Free Press: Journal: Palm Treo call db module

Ocean of Awareness: Removing obsolete versions of Marpa from CPAN

Marpa::XS, Marpa::PP, and Marpa::HTML are obsolete versions of Marpa, which I have been keeping on CPAN for the convenience of legacy users. All new users should look only at Marpa::R2.

I plan to delete the obsolete releases from CPAN soon. For legacy users who need copies, they will still be available on backPAN.

I do this because their placement on CPAN placement makes them "attractive nuisances" -- they show up in searches and generally make it harder to find Marpa::R2, which is the version that new users should be interested in. There is also some danger a new user could, by mistake, use the obsolete versions instead of Marpa::R2.

It's been some time since someone has reported a bug in their code, so they should be stable for legacy applications. I would usually promise to fix serious bugs that affect legacy users, but unfortunately, especially in the case of Marpa::XS, it is a promise I would have trouble keeping. Marpa::XS depends on Glib, and uses a complex build which I last performed on a machine I no longer use for development.

For this reason, a re-release to CPAN with deprecatory language is also not an option. I probably would not want to do so anyway -- the CPAN infrastructure by default pushes legacy users into upgrading, which always carries some risk. New deprecatory language would add no value for the legacy users, and they are the only audience these releases exist to serve.

Comments

Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net. To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.

Perlgeek.de : A new Perl 6 community server - call for funding

So far, many Perl 6 developers have used feather as a generic development server. Juerd, who has genereously provided this server for us for free for many years, has announced that it will be shut down at the end of the year.

My daytime job is at a b2b IT outsourcing and hosting company called noris network, and they have agreed to sponsor the hosting/housing of a 1U 19" server in one of their state-of-the-art data centers in Nürnberg, Germany.

What's missing is the actual hardware. Some folks in the community have already agreed to participate in funding the hardware, though I have few concrete pledges.

So here is the call to action: If you want to help the Perl 6 community with a one-time donation towards a new community server, please send me an e-mail to moritz at faui2k3 dot org, specifying the amount you're willing do pledge, and whether you want to stay private as a donor. I accept money transfer by paypal and wire transfer (SWIFT). Direct hardware donations are also welcome. (Though actual money will be deferred until the final decision what hardware to buy, and thus the total amount required).

How much do we need?

Decent, used 1U servers seem to start at about 250€, though 350€ would get us a lot more bang (mostly RAM and hard disk space). And in general, the more the merrier. (Cheaper offers exist, for example on ebay, but usually they are without hard disks, so the need for extra drives makes them more expensive in total).

With more money, even beefier hardware and/or spare parts and/or a maintainance contract and/new hardware would be an option.

What do we need it for?

The main tasks for the server are:

  • Hosting websites like perl6.org and the synopses
  • Hosting infrastructure like the panda metadata server
  • Be available for smoke runs of the compilers, star distributions and module ecosystem.
  • Be available as a general development machine for people who don't have linux available and/or not enough resources to build some Perl 6 compilers on their own machines comfortably.
  • A place for IRC sessions for community memebers
  • A backup location for community services like the IRC logs, the camelia IRC eval bot etc. Those resources are currently hosted elswewhere, though having another option for hosting would be very valuable.
  • A webspace for people who want to host Perl 6-related material.
  • It is explicitly not meant as a general hosting platform, nor as a mail server.

    Configuration

    If the hardware we get is beefy enough, I'd like to virtualize the server into two to three components. One for hosting the perl6.org and related websites that should be rather stable, and one for the rest of the system. If resources allow it, and depending on feedback I get, maybe a third virtual system for high-risk stuff like evalbot.

    As operating system I'll install Debian Jessie (the current testing), simply because I'll end up maintaing the system, and it's the system I'm most familiar with.

Perlgeek.de : Profiling Perl 6 code on IRC

On the #perl6 IRC channel, we have a bot called camelia that executes small snippets of Perl 6 code, and prints the output that it produces. This is a pretty central part of our culture, and we use it to explain or demonstrate features or even bugs in the compiler.

Here is an example:

10:35 < Kristien> Can a class contain classes?
10:35 < Kristien> m: class A { class B { } }; say A.new.B.new
10:35 <+camelia> rakudo-moar 114659: OUTPUT«No such method 'B' for invocant of 
                 type 'A'␤  in block <unit> at /tmp/g81K8fr9eY:1␤␤»
10:35 < Kristien> :(
10:36 < raydiak> m: class A { class B { } }; say A::B.new
10:36 <+camelia> rakudo-moar 114659: OUTPUT«B.new()␤»

Yesterday and today I spent some time teaching this IRC bot to not only run the code, but optionally also run it through a profiler, to make it possible to determine where the virtual machine spends its time running the code. an example:

12:21 < moritz> prof-m: Date.today for ^100; say "done"
12:21 <+camelia> prof-m 9fc66c: OUTPUT«done␤»
12:21 <+camelia> .. Prof: http://p.p6c.org/453bbe

The Rakudo Perl 6 Compiler on the MoarVM backend has a profile, which produces a fancy HTML + Javascript page, and this is what is done. It is automatically uploaded to a webserver, producing this profile.

Under the hood, it started with a patch that makes it possible to specify the output filename for a profile run, and another one to clear up the fallout from the previous patch.

Then came the bigger part: setting up the Apache virtual host that serves the web files, including a restricted user that only allows up- and downloads via scp. Since the IRC bot can execute arbitrary code, it is very likely that an attacker can steal the private SSH keys used for authentication against the webserver. So it is essential that if those keys are stolen, the attacker can't do much more than uploading more files.

I used rssh for this. It is the login shell for the upload user, and configured to only allow scp. Since I didn't want the attacker to be able to modify the authorized_keys file, I configured rssh to use a chroot below the home directory (which sadly in turn requires a setuid-root wrapper around chroot, because ordinary users can't execute it. Well, nothing is perfect).

Some more patching and debugging later, the bot was ready.

The whole thing feels a bit bolted on; if usage warrants it, I'll see if I can make the code a bit prettier.

Perlgeek.de : YAPC Europe 2013 Day 2

The second day of YAPC Europe was enjoyable and informative.

I learned about ZeroMQ, which is a bit like sockets on steriods. Interesting stuff. Sadly Design decisions on p2 didn't quite qualify as interesting.

Matt's PSGI archive is a project to rewrite Matt's infamous script archive in modern Perl. Very promising, and a bit entertaining too.

Lunch was very tasty, more so than the usual mass catering. Kudos to the organizers!

After lunch, jnthn talked about concurrency, parallelism and asynchrony in Perl 6. It was a great talk, backed by great work on the compiler and runtime. Jonathans talk are always to be recommended.

I think I didn't screw up my own talk too badly, at least the timing worked fine. I just forgot to show the last slide. No real harm done.

I also enjoyed mst's State of the Velociraptor, which was a summary of what went on in the Perl world in the last year. (Much better than the YAPC::EU 2010 talk with the same title).

The Lightning talks were as enjoyable as those from the previous day. So all fine!

Next up is the river cruise, I hope to blog about that later on.

Perlgeek.de : The REPL trick

A recent discussion on IRC prompted me to share a small but neat trick with you.

If there are things you want to do quite often in the Rakudo REPL (the interactive "Read-Evaluate-Print Loop"), it makes sense to create a shortcut for them. And creating shortcuts for often-used stuff is what programming languages excel at, so you do it right in Perl module:

use v6;
module REPLHelper;

sub p(Mu \x) is export {
    x.^mro.map: *.^name;
}

I have placed mine in $HOME/.perl6/repl.

And then you make sure it's loaded automatically:

$ alias p6repl="perl6 -I$HOME/.perl6/repl/ -MREPLHelper"
$ p6repl
> p Int
Int Cool Any Mu
>

Now you have a neat one-letter function which tells you the parents of an object or a type, in method resolution order. And a way to add more shortcuts when you need them.

Ocean of Awareness: Linear? Yeah right.

Linear?

I have claimed that my new parser, Marpa, is linear for vast classes of grammars, going well beyond what the traditional parsers can do. But skepticism is justified. When it comes to parsing algorithms, there have been a lot of time complexity claims that are hand-wavy, misleading or just plain false. This post describes how someone, who is exercising the appropriate degree of skepticism, might conclude that believing Marpa's claims is a reasonable and prudent thing to do.

Marpa's linearity claims seem to be, in comparison with the other parsers in practical use today, bold. Marpa claims linearity, not just for every class of grammar for which yacc/bison, PEG and recursive descent currently claim linearity, but for considerably more. (The mathematical details of these claims are in a section at the end.) It seems too good to be true.

Why should I believe you?

The most important thing to realize, in assessing the believability of Marpa's time complexity claims, is that they are not new. They were already proved in a long-accepted paper in the refereed literature. They are the time complexity claims proved by Joop Leo for his algorithm in 1991, over two decades ago. Marpa is derived from Leo's algorithm, and its time complexity claims are those proved for Leo's algorithm.

Above I said that Marpa's time complexity claims "seem" bold. On any objective assessment, they are in fact a bit of a yawn. The claims seem surprising only because a lot of people are unaware of Leo's results. That is, they are surprising in the same sense that someone who had avoided hearing about radio waves would be surprised to learn that he can communicate instantly with someone on the other side of the world.

So, if there's so little to prove, why does the Marpa paper have proofs? In Marpa, I made many implementation decisions about, and some changes to, the Leo/Earley algorithm. None of my changes produced better time complexity results -- my only claim is that I did not change the Leo/Earley algorithm in a way that slowed it down. To convince myself of this claim, I reworked the original proofs of Leo and Earley, changing them to reflect my changes, and demonstrated that the results that Leo had obtained still held.

Proofs of this kind, which introduce no new mathematical techniques, but simply take a previous result and march from here to there by well-know means, are called "tedious". In journals, where there's a need to conserve space, they are usually omitted, especially if, as is the case with Marpa's time complexity proofs, the results are intuitively quite plausible.

Getting from plausible to near-certain

So let's say you are not going to work through every line of Marpa's admittedly tedious proofs. We've seen that the results are intuitively plausible, as long as you don't reject the previous literature. But can we do better than merely "plausible"?

As an aside, many people misunderstand the phrase "mathematically proven", especially as it applies to branches of math like parsing theory. The fact is that proofs in papers often contain errors. Usually these are minor, and don't affect the result. On the other hand, Jay Earley's paper, while one of the best Computer Science papers ever published, also contained a very serious error. And this error slipped past his Ph.D. committee and his referees. Mathematical arguments and proofs do not allow us to achieve absolute certainty. They can only improve the degree of certainty.

There's a second way to dramatically increase your degree of conviction in Marpa's linearity claims, and it is quite simple. Create examples of problematic grammars, run them and time them. This is not as satisfying as a mathematical proof, because no set of test grammars can be exhaustive. But if you can't find a counter-example to Marpa's linearity claims among the grammars of most interest to you, that should help lift your level of certainty to "certain for all practical purposes".

Much of this increase in certainty can be achieved without bothering to run your own tests. Marpa is in wide use at this point. If Marpa was going quadratic on grammars for which it claimed to be linear, and these were grammars of practical interest, that would very likely have been noticed by now.

I'm still not convinced

Let's suppose all this has not brought you to the level of certainty you need to use Marpa. That means the reasonable thing is to continue to struggle to work with the restrictions of the traditional algorithms, right? No, absolutely not.

OK, so you don't believe that Marpa preserves the advances in power and speed made by Leo. Does that mean that parsers have to stay underpowered? No, it simply means that there should be a more direct implementation of Leo's 1991, bypassing Marpa.

But if you are looking for an implementation of Leo's 1991 algorithm, I think you may end up coming back to Marpa as the most reasonable choice. Marpa's additional features include the ability to use custom, procedural logic, as you can with recursive descent. And Marpa has worked out a lot of the implementation details for you.

Comments

Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net. To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.

Appendix: Some technical details

Above I talked about algorithms, classes of grammars and their linearity claims. I didn't give details because most folks aren't interested. For those who are, they are in this section.

yacc is linear for a grammar class called LALR, which is a subset of another grammar class called LR(1). If you are willing to hassle with GLR, bison claims linearity for all of LR(1). Recursive descent is a technique, not an algorithm, but it is top-down with look-ahead, and therefore can be seen as some form of LL(k), where k depends on how it is implemented. In practice, I suspect k is never much bigger than 3, and usually pretty close to 1. With packratting, PEG can be made linear for everything it parses but there is a catch -- only in limited cases do you know what language your PEG grammar actually parses. In current practice, that means your PEG grammar must be LL(1). Some of the PEG literature looks at techniques for extending this as far as LL-regular, but there are no implementations, and it remains to be seen if the algorithms described are practical.

The Marpa paper contains a proof, based on a proof of the same claim by Joop Leo, that Marpa is linear for LR-regular grammars. The LR-regular grammars include the LR(k) grammars for every k. So Marpa is linear for LR(1), LR(2), LR(8675309), etc. LR-regular also includes LL-regular. So every class of grammar under discussion in the PEG literature is already parsed in linear time by Marpa. From this, it is also safe to conclude that, if a grammar can be parsed by anything reasonably described as recursive descent, it can be parsed in linear time by Marpa.

Perlgeek.de : New Perl 6 community server now live, accepting signups

The new Perl 6 community server is now alive and kicking.

As planned, I've set up KVM virtualization, and so far there are two guest systems. hack.p6c.org is meant for general Perl 6 development activity (which also includes irssi/weechat sessions), and is equipped with 20GB RAM to handle multiple concurrent rakudo-jvm compilations :-). It runs a pretty bare-bones Debian Jessie.

Update: there is now a website for the new server.

www.p6c.org is the web server where I plan to host perl6.org and related (sub-)domains. It's not as beefy as hack, but sufficiently large to compile and run Rakudo, in preparation for future Perl 6-based web hosting. Currently I'm running a copy of several perl6.org subdomains on it (with the domain name p6c instead of perl6 for test purposes); the plan is to switch the perl6.org DNS over once all of the websites have been copied/migrated.

If you have a Perl 6 related use for a shell account or for serving websites, please request an account by email (moritz.lenz@gmail.com) or IRC (moritz on freenode and magnet), including:

  1. Your desired username
  2. What you want to do on the machine(s) (not necessary for #perl6 regulars)
  3. Which of the machine(s) you need access to
  4. Optionally an openssh public key
  5. Whether you'd be willing to help a bit with sysadmin tasks (mostly apt-get update && apt-get dist-upgrade, restarting hung services, killing huge processes)
  6. Software you need installed (it's OK to not know this up-front)

Note that feather.perl6.nl will shut down soon (no fixed date yet, but "end of 2014" is expected), so if you rely on feather now, you should consider migrating to the new server.

The code of conduct is pretty simple:

  1. Be reasonable in your resource usage.
  2. Use technical means to limit your resource usage so that it doesn't accidentally explode (ulimit comes to mind).
  3. Limit yourself to legal and Perl 6-related use cases (no warez).
  4. Help your fellow hackers.

The standard disclaimer applies:

  • Expect no privacy. There will potentially be many root users, who could all read your files and memory.
  • There are no promises of continued service or even support. Your account can be terminated without notice.
  • Place of jurisdiction in Nürnberg, Germany. You have to comply with German law while using the server. (Note that this puts pretty high standards on privacy for any user data you collect, including from web applications). It's your duty to inform yourself about the applicable laws. Illegal activities will be reported to the authorities.

With all that said, happy hacking!.

Dave's Free Press: Journal: Graphing tool

Dave's Free Press: Journal: Travelling in time: the CP2000AN

Dave's Free Press: Journal: XML::Tiny released

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 1

Dave's Free Press: Journal: Thanks, Yahoo!

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 2

Ocean of Awareness: Parsing: Top-down versus bottom-up

Comparisons between top-down and bottom-up parsing are often either too high-level or too low-level. Overly high-level treatments reduce the two approaches to buzzwords, and the comparison to a recitation of received wisdom. Overly low-level treatments get immersed in the minutiae of implementation, and the resulting comparison is as revealing as placing two abstractly related code listings side by side. In this post I hope to find the middle level; to shed light on why advocates of bottom-up and top-down parsing approaches take the positions they do; and to speculate about the way forward.

Top-down parsing

The basic idea of top-down parsing is as brutally simple as anything in programming: Starting at the top, we add pieces. We do this by looking at the next token and deciding then and there where it fits into the parse tree. Once we've looked at every token, we have our parse tree.

In its purest form, this idea is too simple for practical parsing, so top-down parsing is almost always combined with lookahead. Lookahead of one token helps a lot. Longer lookaheads are very sparsely used. They just aren't that helpful, and since the number of possible lookaheads grows exponentially, they get very expensive very fast.

Top-down parsing has an issue with left recursion. It's straightforward to see why. Take an open-ended expression like

    a + b + c + d + e + f + [....]

Here the plus signs continue off to the right, and adding any of them to the parse tree requires a dedicated node which must be above the node for the first plus sign. We cannot put that first plus sign into a top-down parse tree without having first dealt with all those plus signs that follow it. For a top-down strategy, this is a big, big problem.

Even in the simplest expression, there is no way of counting the plus signs without looking to the right, quite possibly a very long way to the right. When we are not dealing with simple expressions, this rightward-looking needs to get sophisticated. There are ways of dealing with this difficulty, but all of them share one thing in common -- they are trying to make top-down parsing into something that it is not.

Advantages of top-down parsing

Top-down parsing does not look at the right context in any systematic way, and in the 1970's it was hard to believe that top-down was as good as we can do. (It's not all that easy to believe today.) But its extreme simplicity is also top-down parsing's great strength. Because a top-down parser is extremely simple, it is very easy to figure out what it is doing. And easy to figure out means easy to customize.

Take another of the many constructs incomprehensible to a top-down parser:

    2 * 3 * 4 + 5 * 6
    

How do top-down parsers typically handle this? Simple: as soon as they realize they are faced with an expression, they give up on top-down parsing and switch to a special-purpose algorithm.

These two properties -- easy to understand and easy to customize -- have catapulted top-down parsing to the top of the heap. Behind their different presentations, combinator parsing, PEG, and recursive descent are all top-down parsers.

Bottom-up parsing

Few theoreticians of the 1970's imagined that top-down parsing might be the end of the parsing story. Looking to the right in ad hoc ways clearly does help. It would be almost paradoxical if there was no systematic way to exploit the right context.

In 1965, Don Knuth found an algorithm to exploit right context. Knuth's LR algorithm was, like top-down parsing as I have described it, deterministic. Determinism was thought to be essential -- allowing more than one choice easily leads to a combinatorial explosion in the number of possibilities that have to be considered at once. When parsers are restricted to dealing with a single choice, it is much easier to guarantee that they will run in linear time.

Knuth's algorithm did not try to hang each token from a branch of a top-down parse tree as soon as it was encountered. Instead, Knuth suggested delaying that decision. Knuth's algorithm collected "subparses".

When I say "subparses" in this discussion, I mean pieces of the parse that contain all the decisions necessary to construct the part of the parse tree that is below them. But subparses do not contain any decisions about what is above them in the parse tree. Put another way, subparses know who they are, but not where they belong.

Subparses may not know where they belong, but knowing who they are is enough for them to be assembled into larger subparses. And, if we keep assembling the subparses, eventually we will have a "subparse" that is the full parse tree. And at that point we will know both who everyone is and where everyone belongs.

Knuth's algorithm stored subparses by shifting them onto a stack. The operation to do this was called a "shift". (Single tokens of the input are treated as subparses with a single node.) When there was enough context to build a larger subparse, the algorithm popped one or more subparses off the stack, assembled a larger subparse, and put the resulting subparse back on the stack. This operation was called a "reduce", based on the idea that its repeated application eventually "reduces" the parse tree to its root node.

In handling the stack, we will often be faced with choices. One kind of choice is between using what we already have on top of the stack to assemble a larger subparse; or pushing more subparses on top of the stack instead ("shift/reduce"). When we decide to reduce, we may encounter the other kind of choice -- we have to decide which rule to use ("reduce/reduce").

Like top-down parsing, bottom-up parsing is usually combined with lookahead. For the same lookahead, a bottom-up parser parses everything that a top-down parser can handle, and more.

Formally, Knuth's approach is now called shift/reduce parsing. I want to demonstrate why theoreticians, and for a long time almost everybody else as well, was so taken with this method. I'll describe how it works on some examples, including two very important ones that stump top-down parsers: arithmetic expressions and left-recursion. My purpose here is bring to light the basic concepts, and not to guide an implementor. There are excellent implementation-oriented presentations in many other places. The Wikipedia article, for example, is excellent.

Bottom-up parsing solved the problem of left recursion. In the example from above,

    a + b + c + d + e + f + [....]

we simply build one subparse after another, as rapidly as we can. In the terminology of shift/reduce, whenever we can reduce, we do. Eventually we will have run out of tokens, and will have reduced until there is only one element on the stack. That one remaining element is the subparse that is also, in fact, our full parse tree.

The top-down parser had a problem with left recursion precisely because it needed to build top-down. To build top-down, it needed to know about all the plus signs to come, because these needed to be fitted into the parse tree above the current plus sign. But when building bottom-up, we don't need to know anything about the plus signs that will be above the current one in the parse tree. We can afford to wait until we encounter them.

But if working bottom-up solves the left recursion problem, doesn't it create a right recursion problem? In fact, for a bottom-up parser, right recursion is harder, but not much. That's because of the stack. For a right recursion like this:

    a = b = c = d = e = f = [....]

we use a strategy opposite to the one we used for the left recursion. For left recursion, we reduced whenever we could. For right recursion, when we have a choice, we always shift. This means we will immediately shift the entire input onto the stack. Once the entire input is on the stack, we have no choice but to start reducing. Eventually we will reduce the stack to a single element. At that point, we are done. Essentially, what we are doing is exactly what we did for left recursion, except that we use the stack to reverse the order.

Arithmetic expressions like

    2 * 3 * 4 + 5 * 6

require a mixed strategy. Whenever we have a shift/reduce choice, and one of the operators is on the stack, we check to see if the topmost operator is a multiply or an addition operator. If it is a multiply operator, we reduce. In all other cases, if there is a shift/reduce choice, we shift.

In the discussion above, I have pulled the strategy for making stack decisions (shift/reduce and reduce/reduce) out of thin air. Clearly, if bottom-up parsing was going to be a practical parsing algorithm, the stack decisions would have to be made algorithmically. In fact, discovering a practical way to do this was a far from trivial task. The solution in Knuth's paper was considered (and apparently intended) to be mathematically provocative, rather than practical. But by 1979, it was thought a practical way to make stack decisions had been found and yacc, a parser generator based on bottom-up parsing, was released. (Readers today may be more familiar with yacc's successor, bison.)

The fate of bottom-up parsing

With yacc, it looked as if the limitations of top-down parsing were past us. We now had a parsing algorithm that could readily and directly parse left and right recursions, as well as arithmetic expressions. Theoreticians thought they'd found the Holy Grail.

But not all of the medieval romances had happy endings. And as I've described elsewhere, this story ended badly. Bottom-up parsing was driven by tables which made the algorithm fast for correct inputs, but unable to accurately diagnose faulty ones. The subset of grammars parsed was still not quite large enough, even for conservative language designers. And bottom-up parsing was very unfriendly to custom hacks, which made every shortcoming loom large. It is much harder to work around a problem in a bottom-up parser than than it is to deal with a similar shortcoming in a top-down parser. After decades of experience with bottom-up parsing, top-down parsing has re-emerged as the algorithm of choice.

Non-determinism

For many, the return to top-down parsing answers the question that we posed earlier: "Is there any systematic way to exploit right context when parsing?" So far, the answer seems to be a rather startling "No". Can this really be the end of the story?

It would be very strange if the best basic parsing algorithm we know is top-down. Above, I described at some length some very important grammars that can be parsed bottom-up but not top-down, at least not directly. Progress like this seems like a lot to walk away from, and especially to walk back all the way to what is essentially a brute force algorithm. This perhaps explains why lectures and textbooks persist in teaching bottom-up parsing to students who are very unlikely to use it. Because the verdict from practitioners seems to be in, and likely to hold up on appeal.

Fans of deterministic top-down parsing, and proponents of deterministic bottom-up parsing share an assumption: For a practical algorithm to be linear, it has to be deterministic. But is this actually the case?

It's not, in fact. To keep bottom-up parsing deterministic, we restricted ourselves to a stack. But what if we track all possible subpieces of parses? For efficiency, we can link them and put them into tables, making the final decisions in a second pass, once the tables are complete. (The second pass replaces the stack-driven see-sawing back and forth of the deterministic bottom-up algorithm, so it's not an inefficiency.) Jay Earley in 1968 came up with an algorithm to do this, and in 1991 Joop Leo added a memoization to Earley's algorithm which made it linear for all deterministic grammars.

The "deterministic grammars" are exactly the bottom-up parseable grammars with lookahead -- the set of grammars parsed by Knuth's algorithm. So that means the Earley/Leo algorithm parses, in linear time, everything that a deterministic bottom-up parser can parse, and therefore every grammar that a deterministic top-down parser can parse. (In fact, the Earley/Leo algorithm is linear for a lot of ambiguous grammars as well.)

Top-down parsing had the advantage that it was easy to know where you are. The Earley/Leo algorithm has an equivalent advantage -- its tables know where it is, and it is easy to query them programmatically. In 2010, this blogger modified the Earley/Leo algorithm to have the other big advantage of top-down parsing: The Marpa algorithm rearranges the Earley/Leo parse engine so that we can stop it, perform our own logic, and restart where we left off. A quite useable parser based on the Marpa algorithm is available as open source.

Comments

Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net. To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.

Perlgeek.de : YAPC Europe 2013 Day 3

The second day of YAPC Europe climaxed in the river boat cruise, Kiev's version of the traditional conference dinner. It was a largish boat traveling on the Dnipro river, with food, drinks and lots of Perl folks. Not having fixed tables, and having to get up to fetch food and drinks led to a lot of circulation, and thus meeting many more people than at traditionally dinners. I loved it.

Day 3 started with a video message from next year's YAPC Europe organizers, advertising for the upcoming conference and talking a bit about the oppurtunities that Sofia offers. Tempting :-).

Monitoring with Perl and Unix::Statgrab was more about the metrics that are available for monitoring, and less about doing stuff with Perl. I was a bit disappointed.

The "Future Perl Versioning" Discussion was a very civilized discussion, with solid arguments. Whether anybody changed their minds remain to be seen.

Carl Mäsak gave two great talks: one on reactive programming, and one on regular expressions. I learned quite a bit in the first one, and simply enjoyed the second one.

After the lunch (tasty again), I attended Jonathan Worthington's third talk, MoarVM: a metamodel-focused runtime for NQP and Rakudo. Again this was a great talk, based on great work done by Jonathan and others during the last 12 months or so. MoarVM is a virtual machine designed for Perl 6's needs, as we understand them now (as opposed to parrot, which was designed towards Perl 6 as it was understood around 2003 or so, which is considerably different).

How to speak manager was both amusing and offered a nice perspective on interactions between managers and programmers. Some of this advice assumed a non-tech-savy manager, and thus didn't quite apply to my current work situation, but was still interesting.

I must confess I don't remember too much of the rest of the talks that evening. I blame five days of traveling, hackathon and conference taking their toll on me.

The third session of lightning talks was again an interesting mix, containing interesting technical tidbits, the usual "we are hiring" slogans, some touching and thoughtful moments, and finally a song by Piers Cawley. He had written the lyrics in the previous 18 hours (including sleep), to (afaict) a traditional irish song. Standing up in front of ~300 people and singing a song that you haven't really had time to practise takes a huge amount of courage, and I admire Piers both for his courage and his great performance. I hope it was recorded, and makes it way to the public soon.

Finally the organizers spoke some closing words, and received their well-deserved share of applause.

As you might have guess from this and the previous blog posts, I enjoyed this year's YAPC Europe very much, and found it well worth attending, and well organized. I'd like to give my heart-felt thanks to everybody who helped to make it happen, and to my employer for sending me there.

This being only my second YAPC, I can't make any far-reaching comparisons, but compared to YAPC::EU 2010 in Pisa I had an easier time making acquaintances. I cannot tell what the big difference was, but the buffet-style dinners at the pre-conference meeting and the river boat cruise certainly helped to increase the circulation and thus the number of people I talked to.

Dave's Free Press: Journal: YAPC::Europe 2007 travel plans

Perlgeek.de : A small regex optimization for NQP and Rakudo

Recently I read the course material of the Rakudo and NQP Internals Workshop, and had an idea for a small optimization for the regex engine. Yesterday night I implemented it, and I'd like to walk you through the process.

As a bit of background, the regex engine that Rakudo uses is actually implemented in NQP, and used by NQP too. The code I am about to discuss all lives in the NQP repository, but Rakudo profits from it too.

In addition one should note that the regex engine is mostly used for parsing grammar, a process which involves nearly no scanning. Scanning is the process where the regex engine first tries to match the regex at the start of the string, and if it fails there, moves to the second character in the string, tries again etc. until it succeeds.

But regexes that users write often involve scanning, and so my idea was to speed up regexes that scan, and where the first thing in the regex is a literal. In this case it makes sense to find possible start positions with a fast string search algorithm, for example the Boyer-Moore algorithm. The virtual machine backends for NQP already implement that as the index opcode, which can be invoked as start = index haystack, needle, startpos, where the string haystack is searched for the substring needle, starting from position startpos.

From reading the course material I knew I had to search for a regex type called scan, so that's what I did:

$ git grep --word scan
3rdparty/libtommath/bn_error.c:   /* scan the lookup table for the given message
3rdparty/libtommath/bn_mp_cnt_lsb.c:   /* scan lower digits until non-zero */
3rdparty/libtommath/bn_mp_cnt_lsb.c:   /* now scan this digit until a 1 is found
3rdparty/libtommath/bn_mp_prime_next_prime.c:                   /* scan upwards 
3rdparty/libtommath/changes.txt:       -- Started the Depends framework, wrote d
src/QRegex/P5Regex/Actions.nqp:                     QAST::Regex.new( :rxtype<sca
src/QRegex/P6Regex/Actions.nqp:                     QAST::Regex.new( :rxtype<sca
src/vm/jvm/QAST/Compiler.nqp:    method scan($node) {
src/vm/moar/QAST/QASTRegexCompilerMAST.nqp:    method scan($node) {
Binary file src/vm/moar/stage0/NQPP6QRegexMoar.moarvm matches
Binary file src/vm/moar/stage0/QASTMoar.moarvm matches
src/vm/parrot/QAST/Compiler.nqp:    method scan($node) {
src/vm/parrot/stage0/P6QRegex-s0.pir:    $P5025 = $P5024."new"("scan" :named("rx
src/vm/parrot/stage0/QAST-s0.pir:.sub "scan" :subid("cuid_135_1381944260.6802") 
src/vm/parrot/stage0/QAST-s0.pir:    push $P5004, "scan"

The binary files and .pir files are generated code included just for bootstrapping, and not interesting for us. The files in 3rdparty/libtommath are there for bigint handling, thus not interesting for us either. The rest are good matches: src/QRegex/P6Regex/Actions.nqp is responsible for compiling Perl 6 regexes to an abstract syntax tree (AST), and src/vm/parrot/QAST/Compiler.nqp compiles that AST down to PIR, the assembly language that the Parrot Virtual Machine understands.

So, looking at src/QRegex/P6Regex/Actions.nqp the place that mentions scan looked like this:

    $block<orig_qast> := $qast;
    $qast := QAST::Regex.new( :rxtype<concat>,
                 QAST::Regex.new( :rxtype<scan> ),
                 $qast,
                 ($anon
                      ?? QAST::Regex.new( :rxtype<pass> )
                      !! (nqp::substr(%*RX<name>, 0, 12) ne '!!LATENAME!!'
                            ?? QAST::Regex.new( :rxtype<pass>, :name(%*RX<name>) )
                            !! QAST::Regex.new( :rxtype<pass>,
                                   QAST::Var.new(
                                       :name(nqp::substr(%*RX<name>, 12)),
                                       :scope('lexical')
                                   ) 
                               )
                          )));

So to make the regex scan, the AST (in $qast) is wrapped in QAST::Regex.new(:rxtype<concat>,QAST::Regex.new( :rxtype<scan> ), $qast, ...), plus some stuff I don't care about.

To make the optimization work, the scan node needs to know what to scan for, if the first thing in the regex is indeed a constant string, aka literal. If it is, $qast is either directly of rxtype literal, or a concat node where the first child is a literal. As a patch, it looks like this:

--- a/src/QRegex/P6Regex/Actions.nqp
+++ b/src/QRegex/P6Regex/Actions.nqp
@@ -667,9 +667,21 @@ class QRegex::P6Regex::Actions is HLL::Actions {
     self.store_regex_nfa($code_obj, $block, QRegex::NFA.new.addnode($qast))
     self.alt_nfas($code_obj, $block, $qast);
 
+    my $scan := QAST::Regex.new( :rxtype<scan> );
+    {
+        my $q := $qast;
+        if $q.rxtype eq 'concat' && $q[0] {
+            $q := $q[0]
+        }
+        if $q.rxtype eq 'literal' {
+            nqp::push($scan, $q[0]);
+            $scan.subtype($q.subtype);
+        }
+    }
+
     $block<orig_qast> := $qast;
     $qast := QAST::Regex.new( :rxtype<concat>,
-                 QAST::Regex.new( :rxtype<scan> ),
+                 $scan,
                  $qast,

Since concat nodes have always been empty so far, the code generators don't look at their child nodes, and adding one with nqp::push($scan, $q[0]); won't break anything on backends that don't support this optimization yet (which after just this patch were all of them). Running make test confirmed that.

My original patch did not contain the line $scan.subtype($q.subtype);, and later on some unit tests started to fail, because regex matches can be case insensitive, but the index op works only case sensitive. For case insensitive matches, the $q.subtype of the literal regex node would be ignorecase, so that information needs to be carried on to the code generation backend.

Once that part was in place, and some debug nqp::say() statements confirmed that it indeed worked, it was time to look at the code generation. For the parrot backend, it looked like this:

    method scan($node) {
        my $ops := self.post_new('Ops', :result(%*REG<cur>));
        my $prefix := self.unique('rxscan');
        my $looplabel := self.post_new('Label', :name($prefix ~ '_loop'));
        my $scanlabel := self.post_new('Label', :name($prefix ~ '_scan'));
        my $donelabel := self.post_new('Label', :name($prefix ~ '_done'));
        $ops.push_pirop('repr_get_attr_int', '$I11', 'self', %*REG<curclass>, '"$!from"');
        $ops.push_pirop('ne', '$I11', -1, $donelabel);
        $ops.push_pirop('goto', $scanlabel);
        $ops.push($looplabel);
        $ops.push_pirop('inc', %*REG<pos>);
        $ops.push_pirop('gt', %*REG<pos>, %*REG<eos>, %*REG<fail>);
        $ops.push_pirop('repr_bind_attr_int', %*REG<cur>, %*REG<curclass>, '"$!from"', %*REG<pos>);
        $ops.push($scanlabel);
        self.regex_mark($ops, $looplabel, %*REG<pos>, 0);
        $ops.push($donelabel);
        $ops;
    }

While a bit intimidating at first, staring at it for a while quickly made clear what kind of code it emits. First three labels are generated, to which the code can jump with goto $label: One as a jump target for the loop that increments the cursor position ($looplabel), one for doing the regex match at that position ($scanlabel), and $donelabel for jumping to when the whole thing has finished.

Inside the loop there is an increment (inc) of the register the holds the current position (%*REG<pos>), that position is compared to the end-of-string position (%*REG<eos>), and if is larger, the cursor is marked as failed.

So the idea is to advance the position by one, and then instead of doing the regex match immediately, call the index op to find the next position where the regex might succeed:

--- a/src/vm/parrot/QAST/Compiler.nqp
+++ b/src/vm/parrot/QAST/Compiler.nqp
@@ -1564,7 +1564,13 @@ class QAST::Compiler is HLL::Compiler {
         $ops.push_pirop('goto', $scanlabel);
         $ops.push($looplabel);
         $ops.push_pirop('inc', %*REG<pos>);
-        $ops.push_pirop('gt', %*REG<pos>, %*REG<eos>, %*REG<fail>);
+        if nqp::elems($node.list) && $node.subtype ne 'ignorecase' {
+            $ops.push_pirop('index', %*REG<pos>, %*REG<tgt>, self.rxescape($node[0]), %*REG<pos>);
+            $ops.push_pirop('eq', %*REG<pos>, -1, %*REG<fail>);
+        }
+        else {
+            $ops.push_pirop('gt', %*REG<pos>, %*REG<eos>, %*REG<fail>);
+        }
         $ops.push_pirop('repr_bind_attr_int', %*REG<cur>, %*REG<curclass>, '"$!from"', %*REG<pos>);
         $ops.push($scanlabel);
         self.regex_mark($ops, $looplabel, %*REG<pos>, 0);

The index op returns -1 on failure, so the condition for a cursor fail are slightly different than before.

And as mentioned earlier, the optimization can only be safely done for matches that don't ignore case. Maybe with some additional effort that could be remedied, but it's not as simple as case-folding the target string, because some case folding operations can change the string length (for example ß becomes SS while uppercasing).

After successfully testing the patch, I came up with a small, artifical benchmark designed to show a difference in performance for this particular case. And indeed, it sped it up from 647 ± 28 µs to 161 ± 18 µs, which is roughly a factor of four.

You can see the whole thing as two commits on github.

What remains to do is implementing the same optimization on the JVM and MoarVM backends, and of course other optimizations. For example the Perl 5 regex engine keeps track of minimal and maximal string lengths for each subregex, and can anchor a regex like /a?b?longliteral/ to 0..2 characters before a match of longliteral, and generally use that meta information to fail faster.

But for now I am mostly encouraged that doing a worthwhile optimization was possible in a single evening without any black magic, or too intimate knowledge of the code generation.

Update: the code generation for MoarVM now also uses the index op. The logic is the same as for the parrot backend, the only difference is that the literal needs to be loaded into a register (whose name fresh_s returns) before index_s can use it.

Dave's Free Press: Journal: Wikipedia handheld proxy

Dave's Free Press: Journal: Bryar security hole

Dave's Free Press: Journal: POD includes

Perlgeek.de : First day at YAPC::Europe 2013 in Kiev

Today was the first "real" day of YAPC Europe 2013 in Kiev. In the same sense that it was the first real day, we had quite a nice "unreal" conference day yesterday, with a day-long Perl 6 hackathon, and in the evening a pre-conference meeting a Sovjet-style restaurant with tasty food and beverages.

The talks started with a few words of welcome, and then the announcement that the YAPC Europe next year will be in Sofia, Bulgaria, with the small side note that there were actually three cities competing for that honour. Congratulations to Sofia!

Larry's traditional keynote was quite emotional, and he had to fight tears a few times. Having had cancer and related surgeries in the past year, he still does his perceived duty to the Perl community, which I greatly appreciate.

Afterwards Dave Cross talked about 25 years of Perl in 25 minutes, which was a nice walk through some significant developments in the Perl world, though a bit hasty. Maybe picking fewer events and spending a bit more time on the selected few would give a smoother experience.

Another excellent talk that ran out of time was on Redis. Having experimented a wee bit with Redis in the past month, this was a real eye-opener on the wealth of features we might have used for a project at work, but in the end we didn't. Maybe we will eventually revise that decision.

Ribasushi talked about how hard benchmarking really is, and while I was (in principle) aware of that fact that it's hard to get right, there were still several significant factors that I overlooked (like the CPU's tendency to scale frequency in response to thermal and power-management considerations). I also learned that I should use Dumbbench instead of the Benchmark.pm core module. Sadly it didn't install for me (Capture::Tiny tests failing on Mac OS X).

The Perl 6 is dead, long live Perl 5 talk was much less inflammatory than the title would suggest (maybe due to Larry touching on the subject briefly during the keynote). It was mostly about how Perl 5 is used in the presenter's company, which was mildly interesting.

After tasty free lunch I attended jnthn's talk on Rakudo on the JVM, which was (as is typical for jnthn's talk) both entertaining and taught me something, even though I had followed the project quite a bit.

Thomas Klausner's Bread::Board by example made me want to refactor the OTRS internals very badly, because it is full of the anti-patterns that Bread::Board can solve in a much better way. I think that the OTRS code base is big enough to warrant the usage of Bread::Board.

I enjoyed Denis' talk on Method::Signatures, and was delighted to see that most syntax is directly copied from Perl 6 signature syntax. Talk about Perl 6 sucking creativity out of Perl 5 development.

The conference ended with a session of lighning talks, something which I always enjoy. Many lightning talks had a slightly funny tone or undertone, while still talking about interesting stuff.

Finally there was the "kick-off party", beverages and snacks sponsored by booking.com. There (and really the whole day, and yesterday too) I not only had conversations with my "old" Perl 6 friends, but also talked with many interesting people I never met before, or only met online before.

So all in all it was a nice experience, both from the social side, and from quality and contents of the talks. Venue and food are good, and the wifi too, except when it stops working for a few minutes.

I'm looking forward to two more days of conference!

(Updated: Fixed Thomas' last name)

Dave's Free Press: Journal: cgit syntax highlighting

Ocean of Awareness: What makes a parsing algorithm successful?

What makes a parsing algorithm successful? Two factors, I think. First, does the algorithm parse a workably-defined set of grammars in linear time? Second, does it allow the application to intervene in the parse with custom code? When parsing algorithms are compared, typically neither of these gets much attention. But the successful algorithms do one or the other.

Does the algorithm parse a workably-defined set of grammars in linear time?

"Workably-defined" means more than well-defined in the mathematical sense -- the definition has to be workable. That is, the definition must be something that, with reasonable effort, a programmer can use in practice.

The algorithms in regular expression engines are workably-defined. A regular expression, in the pure sense consists of a sequence of symbols, usually shown by concatenation:

a b c

or a choice among sequences, usually shown by a vertical bar:

a | b | c

or a repetition of any of the above, typically shown with a star:

a*

or any recursive combination of these. True, if this definition is new to you, it can take time to get used to. But vast numbers of working programming are very much "used to it", can think in terms of regular expressions, and can determine if a particular problem will yield to treatment as a regular expression, or not.

Parsers in the LALR family (yacc, bison, etc.) do not have a workably defined set of grammars. LALR is perfectly well-defined mathematically, but even experts in parsing theory are hard put to decide if a particular grammar is LALR.

Recursive descent also does not have a workably defined set of grammars. Recursive descent doesn't even have a precise mathematical description -- you can say that recursive descent is LL, but in practice LL tables are rarely used. Also in practice, the LL logic is extended with every other trick imaginable, up to and including switching to other parsing algorithms.

Does it allow the user to intervene in the parse?

It is not easy for users to intervene in the processing of a regular expression, though some implementations attempt to allow such efforts. LALR parsers are notoriously opaque. Those who maintain the LALR-driven Perl parser have tried to supplement its abilities with custom code, with results that will not encourage others making the same attempt.

Recursive descent, on the other hand, has no parse engine -- it is 100% custom code. You don't get much friendlier than that.

Conclusions

Regular expressions are a success, and will remain so, because the set of grammars they handle is very workably-defined. Applications using regular expressions have to take what the algorithm gives them, but what it gives them is very predictable.

For example, an application can write regular expressions on the fly, and the programmer can be confident they will run as long as they are well-formed. And it is easy to determine if the regular expression is well-formed. (Whether it actually does what you want is a separate issue.)

Recursive descent does not handle a workably-defined set of grammars, and it also has to be hand-written. But it makes up for this by allowing the user to step into the parsing process anywhere, and "get his hands dirty". Recursive descent does nothing for you, but it does allow you complete control. This is enough to make recursive descent the current algorithm of choice for major parsing projects.

As I have chronicled elsewhere, LALR was once, on highly convincing theoretical grounds, seen as the solution to the parsing problem. But while mathematically well-defined, LALR was not workably defined. And it was very hostile to applications that tried to alter, or even examine, its syntax-driven workings. After decades of trying to make it work, the profession has abandoned LALR almost totally.

What about Marpa?

Marpa has both properties: its set of grammars is workably-defined. And, while Marpa is syntax-driven like LALR and regular expressions, it also allows the user to stop the parse engine, communicate with it about the state of the parse, do her own parsing for a while, and restart the parse engine at any point she wants.

Marpa's workable definition has a nuance that the one for regular expressions does not. For regular expressions, linearity is a given -- they parse in linear time or fail. Marpa parses a much larger class of grammars, the context-free grammars -- anything that can be written in BNF. BNF is used to describe languages in standards, and is therefore itself a kind of "gold standard" for a workable definition of a set of grammars. However, Marpa does not parse everything that can be written in BNF in linear time.

Marpa linearly-parsed set of grammars is smaller than the context-free grammars, but it is still very large, and it is still workably-defined. Marpa will parse any unambiguous language in linear time, unless it contains unmarked middle recursions. An example of a "marked" middle recursion is the language described by

S ::= a S a | x

a string of which is "aaaxaaa", where the "x" marks the middle. An example of an "unmarked" middle recursion is the language described by

S ::= a S a | a

a string of which is "aaaaaaa", where nothing marks the middle, so that you don't know until the end where the middle of the recursion is. If a human can reliably find the middle by eyeball, the middle recursion is marked. If a human can't, then the middle recursion might be unmarked.

Marpa also parses a large set of unambiguous grammars linearly, and this set of grammars is also workably-defined. Marpa parses an ambiguous grammar in linear time if

  • It has no unmarked middle recursions.
  • All right recursions are unambiguous.
  • There are no cycles. A cycle occurs, for example, if there is a rule A ::= A in the grammar.
  • Marpa's level of ambiguity at any location is bounded by a constant.

The term "level of ambiguity" requires a bit of explanation. At any given location, there can be as many rules "in play" as you like, without affecting the level of ambiguity. The key question: What is the maximum number of different origins that a rule might have? (The "origin" of a rule is the location where it begins.) That is, can a rule currently in play have at most 20 different origins? Or could it have its origin at every location so far? If the maximum number of origins is 20 or any other fixed constant, the level of ambiguity is "bounded". But if the maximum number of origins keeps growing as the length of the input grows, the level of ambiguity is unbounded.

For the unambiguous case, Marpa's workable definition encompasses a much larger class of grammars, but is no more complex than that for regular expressions. If you want to extend even further, and work with ambiguous grammars, the definition remains quite workable. Of the four restrictions needed to ensure linearity, the one requiring a bounded level of ambiguity is the only one that might force you to exercise real vigliance -- once you get into ambiguity, unboundedness is easy to slip into.

As for the other three, cycles never occur in a practical grammars, and Marpa reports them, so that you simply fix them when they happen. Most recursions will be left recursions, which are unrestricted. My experience has been that, in practical grammars, unmarked middle recursions and ambiguous right recursions are not especially tempting features. If you note whenever you use a right recursion, checking that it is not ambiguous, and if you note whenever you use a middle recursion, checking that it is marked, then you will stay linear.

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.

Comments

Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: CPAN Testers' CPAN author FAQ

Perlgeek.de : Why is it hard to write a compiler for Perl 6?

Today's deceptively simple question on #perl6: is it harder to write a compiler for Perl 6 than for any other programming language?

The answer is simple: yes, it's harder (and more work) than for many other languages. The more involved question is: why?

So, let's take a look. The first point is organizational: Perl 6 isn't yet fully explored and formally specified; it's much more stable than it used to be, but less stable than, say, targeting C89.

But even if you disregard this point, and target the subset that for example the Rakudo Perl 6 compiler implements right now, or the wait a year and target the first Perl 6 language release, the point remains valid.

So let's look at some technical aspects.

Static vs. Dynamic

Perl 6 has both static and dynamic corners. For example, lexical lookups are statical, in the sense that they can be resolved at compile time. But that's not optional. For a compiler to properly support native types, it must resolve them at compile time. We also expect the compiler to notify us of certain errors at compile time, so there must be a fair amount of static analysis.

On the other hand, type annotations are optional pretty much anywhere, and methods are late bound. So the compiler must also support features typically found in dynamic languages.

And even though method calls are late bound, composing roles into classes is a compile time operation, with mandatory compile time analysis.

Mutable grammar

The Perl 6 grammar can change during a parse, for example by newly defined operators, but also through more invasive operations such as defining slangs or macros. Speaking of slangs: Perl 6 doesn't have a single grammar, it switches back and forth between the "main" language, regexes, character classes inside regexes, quotes, and all the other dialects you might think of.

Since the grammar extensions are done with, well, Perl 6 grammars, it forces the parser to be interoperable with Perl 6 regexes and grammars. At which point you might just as well use them for parsing the whole thing, and you get some level of minimally required self-hosting.

Meta-Object Programming

In a language like C++, the behavior of the object system is hard-coded into the language, and so the compiler can work under this assumption, and optimize the heck out of it.

In Perl 6, the object system is defined by other objects and classes, the meta objects. So there is another layer of indirection that must be handled.

Mixing of compilation and run time

Declarations like classes, but also BEGIN blocks and the right-hand side of constant declarations are run as soon as they are parsed. Which means the compiler must be able to run Perl 6 code while compiling Perl 6 code. And also the other way round, through EVAL.

More importantly, it must be able to run Perl 6 code before it has finished compiling the whole compilation unit. That means it hasn't even fully constructed the lexical pads, and hasn't initialized all the variables. So it needs special "static lexpads" to which compile-time usages of variables can fall back to. Also the object system has to be able to work with types that haven't been fully declared yet.

So, lots of trickiness involved.

Serialization, Repossession

Types are objects defined through their meta objects. That means that when you precompile a module (or even just the setting, that is, the mass of built-ins), the compiler has to serialize the types and their meta objects. Including closures. Do you have any idea how hard it is to correctly serialize closures?

But, classes are mutable. So another module might load a precompiled module, and add another method to it, or otherwise mess with it. Now the compiler has to serialize the fact that, if the second module is loaded, the object from the first module is modified. We say that the serialization context from the second module repossesses the type.

And there are so many ways in which this can go wrong.

General Featuritis

One of the many Perl 6 mottos is "torture the implementor on behalf of the user". So it demands not only both static and dynamic typing, but also functional features, continuations, exceptions, lazy lists, a powerful grammar engine, named arguments, variadic arguments, introspection of call frames, closures, lexical and dynamic variables, packed types (for direct interfacing with C libraries, for example), and phasers (code that is automatically run at different phases of the program).

All of these features aren't too hard to implement in isolation, but in combination they are a real killer. And you want it to be fast, right?

Dave's Free Press: Journal: Thankyou, Anonymous Benefactor!

Dave's Free Press: Journal: Number::Phone release

Dave's Free Press: Journal: Ill

Dave's Free Press: Journal: CPANdeps upgrade

Dave's Free Press: Journal: YAPC::Europe 2006 report: day 3

Subscriptions

Header image by Tambako the Jaguar. Some rights reserved.