End Point Blog: MySQL to PostgreSQL Migration Tips

I recently was involved in a project to migrate a client's existing application from MySQL to PostgreSQL, and I wanted to record some of my experiences in doing so in the hopes they would be useful for others.

Note that these issues should not be considered exhaustive, but were taken from my notes of issues encountered and/or things that we had to take into consideration in this migration process.

Convert the schema

The first step is to convert the equivalent schema in your PostgreSQL system, generated from the original MySQL.

We used `mysqldump --compatible=postgresql --no-data` to get a dump which matched PostgreSQL's quoting rules. This file still required some manual editing to cleanup some of the issues, such as removing MySQL's "Engine" specification after a CREATE TABLE statement, but this resulted in a script in which we were able to create a skeleton PostgreSQL database with the correct database objects, names, types, etc.

Some of the considerations here include the database collations/charset. MySQL supports multiple collations/charset per database; in this case we ended up storing everything in UTF-8, which matched the encoding of the PostgreSQL database, so there were no additional changes needed here; otherwise, it would have been necessary to note the original encoding of the individual tables and later convert that to UTF-8 in the next step.

We needed to make the following modifications for datatypes:

MySQL Datatype PostgreSQL Datatype
tinyint int
int(NN) int
blob text*
datetime date
int unsigned int
enum('1') bool
longtext text
varbinary(NN) bytea

* Note: this was the right choice given the data, but not necessarily in general.

A few other syntactic changes; MySQL's UNIQUE KEY in the CREATE TABLE statement needs to just be UNIQUE.

Some of the MySQL indexes were defined as FULLTEXT indexes as well, which was a keyword PostgreSQL did not recognize. We made note of these, then created just normal indexes for the time being, intending to review to what extent these actually needed full text search capabilities.

Some of the AUTO_INCREMENT fields did not get the DEFAULT value set correctly to a sequence, because those types just ended up as integers without being declared a serial field, so we used the following query to correct this:

-- cleanup missing autoincrement fields

WITH
datasource AS (
    SELECT
        k.table_name,
        k.column_name,
        atttypid::regtype,
        adsrc
    FROM
        information_schema.key_column_usage k
    JOIN
        pg_attribute
    ON
        attrelid = k.table_name :: regclass AND
        attname = k.column_name
    LEFT JOIN
        pg_attrdef
    ON
        adrelid = k.table_name :: regclass AND
        adnum = k.ordinal_position
    WHERE
        table_name IN (
            SELECT table_name::text FROM information_schema.key_column_usage WHERE constraint_name LIKE '%_pkey' GROUP BY table_name HAVING count(table_name) = 1
        ) AND
        adsrc IS NULL AND
        atttypid = 'integer' ::regtype
),
frags AS (
    SELECT
        quote_ident(table_name || '_' || column_name || '_seq') AS q_seqname,
        quote_ident(table_name) as q_table,
        quote_ident(column_name) as q_col
    FROM
        datasource
),
queries AS (
    SELECT
        'CREATE SEQUENCE ' || q_seqname || ';
' ||
        'ALTER TABLE ' || q_table || ' ALTER COLUMN ' || q_col || $$ SET DEFAULT nextval('$$ || q_seqname || $$');
    $$ ||
        $$SELECT setval('$$ || q_seqname || $$',(SELECT max($$ || q_col || ') FROM ' || q_table || '));
' AS query
    FROM frags
)
SELECT
    COALESCE(string_agg(query, E'\n'),$$SELECT 'No autoincrement fixes needed';$$) AS queries FROM queries
\gset

BEGIN;
:queries
COMMIT;

Basically the idea is that we look for all table with a defined integer primary key (hand-waving it it by using the _pkey suffix in the constraint name), but without a current default value, then generate the equivalent SQL to create a sequence and set that table's default value to the nextval() for the sequence in question. We also generate SQL to scan that table and set that sequence value to the next appropriate value for the column in question. (Since this is for a migration and we know we'll be the only user accessing these tables we can ignore MVCC.)

Another interesting thing about this script is that we utilize psql's ability to store results in a variable, using the \gset command, then we subsequently execute this SQL by interpolating that corresponding variable in the same script.

Convert the data


The next step was to prepare the data load from a MySQL data-only dump. Using a similar dump recipe as for the initial import, we used: `mysqldump --compatible=postgresql --no-create-info --extended-insert > data.sql` to save the data in a dump file so we could iteratively tweak our approach to cleaning up the MySQL data.

Using our dump file, we attempted a fresh load into the new PostgreSQL database. This failed initially due to multiple issues, including ones of invalid character encoding and stricter datatype interpretations in PostgreSQL.

What we ended up doing was to create a filter script to handle all of the "fixup" issues needed here. This involved decoding the data and reencoding to ensure we were using proper UTF8, performing some context-sensitive datatype conversions, etc.

Additional schema modifications


As we were already using a filter script to process the data dump, we decided to take the opportunity to fixup some warts in the current table definitions. This included some fields which were varchar, but should have actually been numeric or integer; as this was a non-trivial schema (100 tables) we were able to use PostgreSQL's system views to identify a list of columns which should should be numeric and were currently not.

Since this was an ecommerce application, we identified columns that were likely candidates for data type reassignment based on field names *count, *qty, *price, *num.

Once we identified the fields in question, I wrote a script to generate the appropriate ALTER TABLE statements to first drop the default, change the column type, then set the new default. This was done via a mapping between table/column name and desired output type.

Convert the application


The final (and needless to say most involved step) was to convert the actual application itself to work with PostgreSQL. Despite the fact that these databases both speak SQL, we had to come up for solutions for the following issues:

Quotation styles


MySQL is more lax with its quoting styles, so some of this migration involved hunting down differences in quoting styles. The codebase contained lots of double-quoted string literals, which PostgreSQL interprets as identifiers, as well as the difference in quoting of column names (backticks for MySQL, double-quotes for PostgreSQL). These had to be identified whereever they appeared and fixed to use a consistent quoting style.

Specific unsupported syntax differences:


INSERT ON DUPLICATE KEY

MySQL supports the INSERT ON DUPLICATE KEY syntax. Modifying these queries involved creating a special UPSERT-style function to support the different options in use in the code base. We isolated and categorized the uses of INSERT ON DUPLICATE KEY UPDATE into several categories: those which did a straight record replace and those which did some sort of modification. I wrote a utility script (detailed later in this article) which served to replicate the logic needed to handle this as the application would expect.

Upcoming versions of PostgreSQL are likely to incorporate an INSERT ... ON CONFLICT UPDATE/IGNORE syntax, which would produce a more direct method of handling migration of these sorts of queries.

INSERT IGNORE

MySQL's INSERT ... IGNORE syntax allows you to insert a row and effectively ignore a primary key violation, assuming that the rest of the row is valid. You can handle this case via creating a similar UPSERT function as in the previous point. Again, this case will be easily resolved if PostgreSQL adopts the INSERT ... ON CONFLICT IGNORE syntax.

REPLACE INTO

MySQL's REPLACE INTO syntax effectively does a DELETE followed by an INSERT; it basically ensures that a specific version of a row exists for the given primary key value. We handle this case by just modifying these queries to do an unconditional DELETE for the Primary Key in question followed by the corresponding INSERT. We ensure these are done within a single transaction so the result is atomic.

INTERVAL syntax

Date interval syntax can be slightly different in MySQL; intervals may be unquoted in MySQL, but must be quoted in PostgreSQL. This project necessitated hunting down several instances to add quoting of specific literal INTERVAL instances.

Function considerations


last_insert_id()

Many times when you insert a records into a MySQL table, later references to this are found using the last_insert_id() SQL function. These sorts of queries need to be modified to utilize the equivalent functionality using PostgreSQL sequences, such as the currval() function.

GROUP_CONCAT()

MySQL has the GROUP_CONCAT function, which serves as a (string) "join" of sorts. We emulate this behavior in PostgreSQL by using the string_agg aggregate function with the delimiter of choice.

CONCAT_WS() - expected to be but not an issue; PG has this function

PostgreSQL has included a CONCAT_WS() function since PostgreSQL 9.1, so this was not an issue with the specific migration, but could still be an issue if you are migrating to an older version of PostgreSQL.

str_to_date()

This function does not exist directly in PostgreSQL, but can be simulated using to_date(). Note however that the format string argument differs between MySQL and PostgreSQL's versions.

date_format()

MySQL has a date_format() function which transforms a date type to a string with a given format option. PostgreSQL has similar functionality using the to_char() function; the main difference here lies in the format string specifier.

DateDiff()

DateDiff() does not exist in PostgreSQL, this is handled by transforming the function call to the equivalent date manipulation operators using the subtraction (-) operator.

rand() to random()

This is more-or-less a simple function rename, as the equivalent functionality for returning a random float between 0.0 <= x <= 1.0 exists in PostgreSQL and MySQL, it's just what the function name itself is. The other difference is that MySQL supports a scale argument so the random number for rand(N) will be returned between 0.0 <= x <= N, whereas you'd have to scale the result in PostgreSQL yourself, via random() * N.

IF() to CASE WHEN ELSE

MySQL has an IF() function which returns the second argument in the case the first argument evaluates to true otherwise returns the third argument. This can be trivially converted from IF(expression1, arg2, arg3) to the equilvalent PostgreSQL syntax: CASE WHEN expression1 THEN arg2 ELSE arg3.

IFNULL() to COALESCE()

MySQL has a function IFNULL() which returns the first argument if it is not NULL, otherwise it returns the second argument. This can effecively be replaced by the PostgreSQL COALESCE() function, which serves the same purpose.

split_part()

MySQL has a built-in function called split_part() which allows you to access a specific index of an array delimited by a string. PostgreSQL also has this function, however the split_part() function in MySQL allows the index to be negative, in which case this returns the part from the right-hand side.

in MySQL:
split_part('a banana boat', ' ', -1) => 'boat'
in PostgreSQL:
split_part('a banana boat', ' ', -1) => // ERROR:  field position must be greater than zero

I fixed this issue by creating a custom plpgsql function to handle this case. (In my specific case, all of the negative indexes were -1; i.e., the last element in the array, so I created a function to return only the substring occurring after the last instance of the delimiter.)

Performance considerations


You may need to revisit COUNT(*) queries

MySQL ISAM tables have a very fast COUNT(*) calculation, owing to queries taking a table lock, while PostgreSQL utilizes MVCC in order to calculate table counts which necessitates a full table scan to see which rows are visible to the specific calling snapshot, so this assumption may need to be revisited in order to calculate an equivalent performant query.

GROUP BY vs DISTINCT ON

MySQL is much more (*ahem*) flexible when it comes to GROUP BY/aggregate queries, allowing some columns to be excluded in a GROUP BY or an aggregate function. Making the equivalent query in PostgreSQL involves transforming the query from SELECT ... GROUP BY (cols) to a SELECT DISTINCT ON (cols) ... and providing an explicit sort order for the rows.

More notes

Don't be afraid to script things; in fact, I would go so far as to suggest that everything you do should be scripted. This process was complicated and there were lots of moving parts to ensure moved in tandem. There were changes being made on the site itself concurrently, so we were doing testing against a dump of the original database at a specific point-in-time. Having everything scripted ensured that this process was repeatable and testable, and that we could get to a specific point in the process without having to remember anything I'd done off-the-cuff.

In addition to scripting the actual SQL/migrations, I found it helpful to script the solutions to various classifications of problems. I wrote some scripts which I used to create some of the various scaffolding/boilerplate for the tables involved. This included a script which would create an UPSERT function for a specific table given the table name, which was used when replaceing the INSERT ON DUPLICATE KEY UPDATE functions. This generated script could then be tailored to handle more complex logic beyond a simple UPDATE. (One instance here is an INSERT ON DUPLICATE KEY UPDATE which increased the count of a specific field in the table instead of replacing the value.)

#!/usr/bin/env perl
# -*-cperl-*-

use strict;
use warnings;

use Data::Dumper;

my $table = shift or die "Usage: $0 <table>\n";
my @cols = @ARGV;

my $dbh = DBI->connect(...);

my @raw_cols = @{ $dbh->column_info(undef, 'public', $table, '%')->fetchall_arrayref({}) };
my %raw_cols = map { $_->{COLUMN_NAME} => $_ } @raw_cols;

die "Can't find table $table\n" unless @raw_cols;

my @missing_cols = grep { ! defined $raw_cols{$_} } @cols;

die "Referenced non-existing columns: @missing_cols\n" if @missing_cols;

my %is_pk;

unless (@cols) {
    @cols = map { $_->{COLUMN_NAME} } @raw_cols;
}

my @pk_cols = $dbh->primary_key(undef, 'public', $table);

@is_pk{@pk_cols} = (1)x@pk_cols;

my @data_cols = grep { ! $is_pk{$_} } @cols;

die "Missing PK cols from column list!\n" unless @pk_cols == grep { $is_pk{$_} } @cols;
die "No data columns!\n" unless @data_cols;

print <<EOF
CREATE OR REPLACE FUNCTION
    upsert_$table (@{[
    join ', ' => map {
        "_$_ $raw_cols{ $_ }->{pg_type}"
    } @cols
]})
RETURNS void
LANGUAGE plpgsql
AS \$EOSQL\$
BEGIN
    LOOP
        UPDATE $table SET @{[
    join ', ' => map { "$_ = _$_" } @data_cols
]} WHERE @{[
    join ' AND ' => map { "$_ = _$_" } @pk_cols
]};
        IF FOUND THEN
            RETURN;
        END IF;
        BEGIN
            INSERT INTO $table (@{[join ',' => @cols]}) VALUES (@{[join ',' => map { "_$_" } @cols]});
            RETURN;
        EXCEPTION WHEN unique_violation THEN
            -- Do nothing, and loop to try the UPDATE again.
        END;
    END LOOP;
END
\$EOSQL\$
;
EOF
;

This script created an upsert function from a given table to update all columns by default, also allowing you to create one with a different number of columns upserted.

I also wrote scripts which could handle/validate some of the column datatype changes. Since there were large numbers of columns which were changed, often multiple in the same table, I was able to have this script create a single ALTER TABLE statement with multiple ALTER COLUMN TYPE USING clauses, plus be able to specify the actual method that these column changes were to take place. These included several different approaches, depending on the target data type, but generally were to solve cases where there were farily legitimate data that was not picked up by PostgreSQL's input parsers. These included how to interpret blank fields as integers (in some cases we wanted it to be 0, in others we wanted it to be NULL), werid numeric formatting (leaving off numbers before or after the decimal point), etc.

We had to fix up in several locations missing defaults for AUTO_INCREMENT columns. The tables were created with the proper datatype, however we had to find tables which matched a specific naming convention and create/associate a sequence/serial column, set the proper default here, etc. (This was detailed above.)

There was a fair amount of iteration and customization in this process, as there was a fair amount of data which was not of the expected format. The process was iterative, and generally involved attempting to alter the table from within a transaction and finding the next datum which the conversion to the expected type did not work. This would result in a modification of the USING clause of the ALTER TABLE ALTER COLUMN TYPE to accommodate some of the specific issues.

In several cases, there were only a couple records which had bad/bunko data, so I included explicit UPDATE statements to update those data values via primary key. While this felt a bit "impure", it was a quick and preferred solution to the issue of a few specific records which did not fit general rules.

Laufeyjarson writes... » Perl: PBP: 060 Slicing

Everyone’s favorite set of Perl guideliens tells is not to forget about the awesome power of array slices!  That’s smart, too.

The book then goes on to give shiny examples, and warn of a few pitfalls of using array slices.  They’re good and clear.

Part of the book’s age is showing in that it does not mention the hash slice, added less than a year ago in Perl 5,20.  They’re another way to extract data from a hash, given a specific set of keys, like an array slice.  Neat, but I haven’t bumped into a use in the wild yet.

I do find the reason the book gave for using array slices to be true; readability and clarity.  If you’re extracting several array items in a row, that’s a slice, and it might be more clearly written that way.

On the other hand, a lot of Perl programmers don’t know slices exist, and will tell you that leading an array use with @ will not work.  This means that using them makes the code readable and maintainable by only very advanced engineers.  I find this at odds with other advice in the book, which caters to people stuck on hard terminals and awk programmers and things.  I prefer to lean this way; encourage deeper understanding of the language, and use the features available.  It’s a teachable moment for any engineer who isn’t familiar with slices.  Seize it and encourage them to grow!

This means, too, that we can let awk programmers keep up with the times, and encourage everyone to get bitmapped displays with real fonts.  I’m sure I’ll have more to rant about this when I get to Chapter 12.

Perl Foundation News: Grant Report: Inline::C(PP) - November 2014

Ingy and David have made great strides this month on Inline::Module. Weekly updates can be found at the Ouistreet Inline blog.

Highlights

According to David: "In brief, our primary objective now works for four out of the five toolchains we intend to support, and we will get the last one done probably this week. After that some polish and a few demonstration releases, and we will be ready to wrap it up."

MAJ

NEILB: todoist - a script interface to todoist.com

Last night I released the first version of App::todoist, which provides a todoist script. This is a simple command-line interface to todoist.com, an online todo list service. About 11pm I realised I had an hour to get something released to CPAN, or I'd break my weekly release chain.

Perl Foundation News: Inside look at TPF's budget process

It hasn't always been the case that The Perl Foundation has much of a budget to speak of. There were many early years that we flew by the seat of our pants. "Is it in the budget" was more or less the same as asking "what's left in the bank account?" But, as the foundation has grown up, so have our business and accounting practices.

First, it is helpful to know that The Perl Foundation has an accountant and bookkeeper that we work with in order to prepare our taxes, keep our records straight, and to help out with the interesting accounting questions that come along with running a charitable foundation. Everything that we do is tracked using commonly accepted standard accounting practices by an independent, licensed accountant.

So, what's up with this mysterious TPF budget anyway? How does it all work? As with everything else, it starts with the Perl community. The leaders of TPF do their best to listen to the community. We try to feel out what the needs are. We also solicit feedback to determine the attitude on existing programs: do the members of the community favor what we are doing? Have the programs been successful? What needs adjusted?

Next, the Treasurer reaches out to the Chairpeople of TPF's committees: Community Advocacy, Conferences, Marketing, Grants, and Steering. They needs to write up their plans for the coming year. They need to explain what they want to do, and how much it will cost.

Once all of that is done, the Treasurer meets with the President to get feedback on her visions for the upcoming year. The feedback that has been received and various committee requests are discussed. There are also program that operate outside of a specific committee (for example, the Outreach Program for Women.) All of this must be considered in detail in order to formulate both a reasonable set of objectives for the upcoming year and a matching budget. The outcome is a draft plan for the following year.

Finally, the draft plan is sent along to The Perl Foundation's Board of Directors. The Board reviews the recommendations, makes adjustments as they deem necessary, and ultimately approves the final budget.

Once the budget has been approved, each chairperson has the discretion to spend their budget on the items they had initially requested (within a few limitations and guidelines.)

The budgets can get a little complicated due to the way we handle fund accounting: We have a general fund that handle a majority of our operations. We also operate separate funds for the Ian Hague grants and the Perl 5 Core grants. So, each of those need to be budgeted separately.

We also operate separate custodial funds for the various Perl workshops that associate with us. But since TPF's money typically does not flow in or out of the Perl workshops, we do not budget those.

The entire process takes about three months to complete. Requests from Chairpeople were due today. We expect to have the 2015 budget final by early January.

Ricardo Signes: Email::MIME::Kit v3 will fix-and-or-break your code

Ever since its early releases, Email::MIME::Kit had a big problem. It screwed up encodings. Specifically, imagine this manifest (I'm kinda skipping some required junk):

# manifest.yaml
renderer: TemplateToolkit
headers:
  - Subject: "Message for [% name %]"
alternatives:
  - type: text/plain
    path: body.txt
  - type: text/html
    path: body.html

The manifest turns into a data structure before it's used, and the subject header is a text string that, later, will get encoded into MIME encoded-words on the assumption that it's all Unicode text.

The files on disk are read with :raw, then filled in as-is, and trusted to already be UTF-8.

If your customer's name is Распутин, strangely enough, you're okay. The header handling encodes it properly, and the wide characters (because Cyrillic codepoints are all above U+00FF) turn into UTF-8 with a warning. On the other hand, for some trouble, consider Ævar Arnfjörð Bjarmason. All those codepoints are below U+0100, so the non-ASCII ones are encoded directly, and you end up with =C6 (Æ) in your quoted-printable body instead of =C3=86 (Æ UTF-8 encoded).

Now, you're probably actually okay. Your email is not correct, but email clients are good at dealing with your (read: my) stupid mistakes. If your email part says it's UTF-8 but it's actually Latin-1, mail clients will usually do the right thing.

The big problem is when you've got both Ævar Arnfjörð Bjarmason and Распутин both in your email. Your body is a mish mash of Latin-1 and UTF-8 data.

In Email::MIME::Kit v3, templates (or non-template bodies) loaded from disk are — if and only if they're for text/* parts — decoded into text and then, when the email is assembled, it's encoded by Email::MIME's usual header_str handling.

There's a case where this can start making things worse, rather than better. If you know that templates in files are treated as bytes, you might be passing in strings pre-encoded into UTF-8. If that was the case, it will now become mojibake.

Finally, plugins that read kit contents for uses as text will need upgrading. The only one I know of like this is my own Email::MIME::Kit::Assembler::Markdown. I will fix it. The trick is: look at what content-type is being built and consider using get_decoded_kit_entry instead of get_kit_entry.

I think this is an important change, and worth the breakage. Please look at your use of EMK and test with v3.

Laufeyjarson writes... » Perl: PBP: 059 Array Indices

The PBP reminds us that we can use negative indexes to access arrays in reverse, instead of having to do calculations for it.  It suggests several useful reasons, including readability and getting the interpreter to complain in some cases instead of silently doing the wrong thing.  I don’t feel that strongly about this suggestion, but do follow it.

Negative indexes access an array from the end.  Sometimes I don’t think this is very well known, and I have been known to use it as an interview question to try and plumb the depths of the candidate’s Perl knowledge.

Other than accessing the last index with -1, I believe I have used this twice in my entire career.  Once when processing a list backwards, and once when I needed the last several items.  Today I might have done the latter with an array slice instead.

I agree with both reasons presented in the book, but haven’t found them to be that critical in real world applications.

The only problem I have doing this with perlcritic on is the actual use of the negative offsets, as they are not accepted as numbers.  I wish -1, particularly, were, both for the last element in an array and to invert the sign of a number.  Having to define constants for those is goofy.  But that’s a different practice!

Ovid: Creating Missions in Veure

Creating missions in Veure is probably the most challenging issue in the game. Not only is it technically complex, but in terms of game design, it's very easy to get wrong. If you get missions wrong, it's easy for the game to be a dud. So today I'm going to talk about how World of Warcraft got their missions (er, quests), so very wrong, yet still became the most wildly successful MMORPG of all time.

See that quest on the right? That's a level 6 quest named "A Putrid Task". All you have to do is bring seven "Putrid Claws" to Deathguard Dillinger in Brill. There is nothing challenging about this. Go out, find the claws, pick 'em up and bring 'em back. However, this is where it gets interesting. Let's jump ahead to a typical level 42 quest in World of Warcraft.

This is a level 42 quest named "Carcass Collection". In this quest, you go out, find 10 carcasses and bring them back to Ajamon Ghostcaller in Thousand Needles. Here's the quest text for that:

These organs be a good start, mon, but we still be needin' lotsa other tings. Normally, we'd be gettin' stuff from family now, but dats only me and my sistah's boy, and Tony ain't worth it!

So, we be substitutin' some stuff. Lotsa carcasses, in fact!

Deep down in the Shimmering Deep, there's tons of dem from when the water came rushin' in. Killed all of dem creatures. Go down there and bring me up some.

Did you read that? I wouldn't be surprised if more people read that quest text from my blog than in the damned game. After a while in WoW, you stop reading the text. You click the yellow exclamation point above someone's head, see what numbers you need to get, decide if you want the reward, and off you go. Generally players click straight through, collecting several quests at the same time and rush off to an area and grind through all of the quests at the same time.

As you're grinding your way to max level in WoW, this is wash/rinse/repeat for weeks on end. Aside from pretty pictures, there's nothing particularly different about this. You might claim that WoW quests are linear and simplistic. You might also claim that water is wet. To say that Blizzard has received some criticism over this is a massive understatement. Many people have complained bitterly that Blizzard has dumbed down quests to the point where anybody can complete them.

I'll let that sink in for a minute. Meanwhile, contemplate this from the WoW Wikipedia entry:

With 7.4 million subscribers as of October 2014, World of Warcraft is currently the world's most-subscribed MMORPG, and holds the Guinness World Record for the most popular MMORPG by subscribers. Having grossed over 10 billion dollars USD as of July 2012, it is also the highest grossing video game of all time, surpassing Call of Duty: Black Ops at 1.5 billion dollars. In January 2014 it was announced that more than 100 million accounts had been created over the game's lifetime.

I don't pretend Veure will ever hold a candle to that, but I'm also not going to disregard that. Much of my work with Veure has been in heavy research on how these games are created, in addition to their business models. There is no point in me trying to create a new company if I don't understand the business. This is a high-risk endeavor for me as my small amount of free time is spent building this.

So why did WoW become so successful? Because Blizzard worked hard to make the game accessible. Sure, they could have done a lot more to add complexity to various aspects of their MMORPG, but they were also faced with the enormous challenge of creating a graphic MMORPG. That's probably a couple of orders of magnitude harder than what I'm doing. They made some great trade-offs and there's still plenty of room for creativity and thoughtful play; it's just not in the quest system.

So, Veure quests. Er, missions. What about 'em? Well, here's a beginning quest named Deliver "French Bread": pick up a package and deliver it to someone on another station.

Whoa! Wait a minute? Didn't I just complain about linear, boring, repetitive quests? Well, yes. Yes I did. However, bear with me for a moment.

Veure is envisioned as a free-to-play MMORPG. There will be no limitations on what you can do in free play and, just as important, no pay to win crap. In "pay to win" games, you can play for free, but if you don't pay, you often can't advance. Some aspects are just so difficult that without paying that dollar for bonuses, you get stuck. And a dollar isn't so bad, is it?

Well, yes it is. You pay dollar after dollar after dollar. Successful players have fat wallets.

That sucks. I want everyone to be able to play and enjoy the game, but I also have to build a successful business model. My approach is a compromise: build a fun game, but one in which players can "donate" a small amount of money which let's them regenerate faster. This means they can advance faster. If you don't pay, you can still play, but you advance at the normal rate. However, there is still no limit to your advancement. There are no artificial constraints.

In order to make Veure into something that players want to play, there are many things I have to do. One of them is to make the mission system more interesting. So many quests will still be linear ... on the surface. Take the Deliver "French Bread" mission. On Tau Station, in the Gaul Embassy, Ambassador Talleyrand asks you to find Montazano because he might have work for you.

You search and eventually you find Montazano skulking about the ruins of Tau Station (like all space stations in the game, The Catastrophe destroyed much of the station and today, three centuries later, they're still being rebuilt). Montazano asks you deliver a package of "French Bread" to Hervé in the Nouveau Limoges station. So you take the package, catch a shuttle to Nouveau Limoges and gain 10 experience points just for visiting a new station. Except that you're deported back to Tau Station because you forgot to by a visa to visit Gaul stations.

So you head to the Gaul embassy, buy a visa, and go back to Nouveau Limoges, find Hervé, deliver the "French Bead", and the mission is completed. A very typical, linear, Fedex quest. Boring, but it's also an "exit quest" in that it forces to visit a new area.

So why is "French Bread" in scare quotes? Well, if you look at the "French Bread" item you received, it's actually a tightly wrapped box with no visible means of opening it. Whatever it is, it's obviously not French bread. So let's look at the mission again.

What are those red, dotted lines? Those are secret, alternate endings to this quest. What do they do? If you enter the security station on either Tau Station or Nouveau Limoges, you're presented with the opportunity to turn over the "French Bread." Given that Tau Station is a Consortium station and Nouveau Limoges is a Gaul station and you were given the quest by a Gaul ambassador, it's reasonable to expect that a different result might occur depending upon who you give the package to.

And what happens if you attack either Montazano or Hervé?

If you look at the new quest diagram, it's clear that it's actually a directed acyclic graph (in fact, it's actually a state machine with pre- and post-conditions for each node). The graph probably needs to be directed to reduce complexity, but why does it need to be acyclic? Could you find yourself in a complex mission and find yourself back at the starting point? Imagine a mission looking like this:

Well, that changes things a bit, doesn't it?

For free-to-play games like I'm describing, one strong component to making a business model is to have these be slow games. You can't grind to max level in a couple of weeks. That would kill the game and why would anyone pay? It's the same well-known problem with having story in games: stories have a beginning and an ending. The ending means the end of income. Oops.

Slow games like Veure are more like pleasant hobbies. You can still interact with others via chat, corporations, or some other ideas I've been toying with, but to progress, that's slow. It needs to be. You pop over to your browser every once in a while, do a few things and wait. In fact, I am envisioning upper-level missions which could take weeks to complete. Missions with alternate pathways, missions you can't get without sufficient reputation and which might cut off other avenues (do too much work for Freebooters and the Consortium might get very upset with you).

Much of the above for missions was working about a week ago, but that was an exploratory spike. Once I understood the mechanics better, I tossed it because the interface was awful and too obvious. Instead, if you click on an NPC's name and there's a "Talk" link which wasn't there before, what does it do? Did you even notice it?

Building an effective mission system is hard. I want it to have the popular linear characteristics which are so attractive to many players, but also have exploratory game play that serious players want to explore. It's going to take a long time to build this, and even longer to create the missions for it. Currently, WoW has over 9,000 quests in the game. Even with my procedural quest generation work, there's a huge amount to research.

A note about space stations: stations each have a theme. What sort of quests would you find in an oceanic station, where everything's underwater and the inhabitants have genetically modified themselves to have gills and withstand extreme pressure? You can leave most of your kinetic and energy weapons behind because they won't do you any good.

Or what about a station that's loosely based on the films of Jeunet and Caro? Yeah, some of my ideas are kind of out there. I'm not sure if I'll go there, but I think it would be a very intriguing possibility.

Building the infrastructure of a galaxy is hard enough. Now I need to populate it.

So, what do you think so far? Tell me!

PAL-Blog: Kürbissuppe

Das Veggie-Experiment geht in die zweite Runde: Lässt sich eine Familie teilweise vegetarisch oder sogar vegan ernähren, ohne das sie es merken? Diese Woche lautet die Antwort: Klar, mit Kürbissuppe und Schwerhörigkeit.

Dave's Free Press: Journal: Module pre-requisites analyser

Dave's Free Press: Journal: CPANdeps

Dave's Free Press: Journal: Perl isn't dieing

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 3

Perlgeek.de : The Fun of Running a Public Web Service, and Session Storage

One of my websites, Sudokugarden, recently surged in traffic, from about 30k visitors per month to more than 100k visitors per month. Here's the tale of what that meant for the server side.

As a bit of background, I built the website in 2007, when I knew a lot less about the web and programming. It runs on a host that I share with a few friends; I don't have root access on that machine, though when the admin is available, I can generally ask him to install stuff for me.

Most parts of the websites are built as static HTML files, with Server Side Includes. Parts of those SSIs are Perl CGI scripts. The most popular part though, which allows you to solve Sudoku in the browser and keeps hiscores, is written as a collection of Perl scripts, backed by a mysql database.

When at peak times the site had more than 10k visitors a day, lots of visitors would get a nasty mysql: Cannot connect: Too many open connections error. The admin wasn't available for bumping the connection limit, so I looked for other solutions.

My first action was to check the logs for spammers and crawlers that might hammered the page, and I found and banned some; but the bulk of the traffic looked completely legitimate, and the problem persisted.

Looking at the seven year old code, I realized that most pages didn't actually need a database connection, if only I could remove the session storage from the database. And, in fact, I could. I used CGI::Session, which has pluggable backend. Switching to a file-based session backend was just a matter of changing the connection string and adding a directory for session storage. Luckily the code was clean enough that this only affected a single subroutine. Everything was fine.

For a while.

Then, about a month later, the host ran out of free disk space. Since it is used for other stuff too (like email, and web hosting for other users) it took me a while to make the connection to the file-based session storage. What happened was 3 million session files on a ext3 file system with a block size of 4 kilobyte. A session is only about 400 byte, but since a file uses up a multiple of the block size, the session storage amounted to 12 gigabyte of used-up disk space, which was all that was left on that machine.

Deleting those sessions turned out to be a problem; I could only log in as my own user, which doesn't have write access to the session files (which are owned by www-data, the Apache user). The solution was to upload a CGI script that deleted the session, but of course that wasn't possible at first, because the disk was full. In the end I had to delete several gigabyte of data from my home directory before I could upload anything again. (Processes running as root were still writing to reserved-to-root portions of the file system, which is why I had to delete so much data before I was able to write again).

Even when I was able to upload the deletion script, it took quite some time to actually delete the session files; mostly because the directory was too large, and deleting files on ext3 is slow. When the files were gone, the empty session directory still used up 200MB of disk space, because the directory index doesn't shrink on file deletion.

Clearly a better solution to session storage was needed. But first I investigated where all those sessions came from, and banned a few spamming IPs. I also changed the code to only create sessions when somebody logs in, not give every visitor a session from the start.

My next attempt was to write the sessions to an SQLite database. It uses about 400 bytes per session (plus a fixed overhead for the db file itself), so it uses only a tenth of storage space that the file-based storage used. The SQLite database has no connection limit, though the old-ish version that was installed on the server doesn't seem to have very fine-grained locking either; within a few days I could errors that the session database was locked.

So I added another layer of workaround: creating a separate session database per leading IP octet. So now there are up to 255 separate session database (plus a 256th for all IPv6 addresses; a decision that will have to be revised when IPv6 usage rises). After a few days of operation, it seems that this setup works well enough. But suspicious as I am, I'll continue monitoring both disk usage and errors from Apache.

So, what happens if this solution fails to work out? I can see basically two approaches: move the site to a server that's fully under my control, and use redis or memcached for session storage; or implement sessions with signed cookies that are stored purely on the client side.

Dave's Free Press: Journal: Devel::CheckLib can now check libraries' contents

Perlgeek.de : Rakudo's Abstract Syntax Tree

After or while a compiler parses a program, the compiler usually translates the source code into a tree format called Abstract Syntax Tree, or AST for short.

The optimizer works on this program representation, and then the code generation stage turns it into a format that the platform underneath it can understand. Actually I wanted to write about the optimizer, but noticed that understanding the AST is crucial to understanding the optimizer, so let's talk about the AST first.

The Rakudo Perl 6 Compiler uses an AST format called QAST. QAST nodes derive from the common superclass QAST::Node, which sets up the basic structure of all QAST classes. Each QAST node has a list of child nodes, possibly a hash map for unstructured annotations, an attribute (confusingly) named node for storing the lower-level parse tree (which is used to extract line numbers and context), and a bit of extra infrastructure.

The most important node classes are the following:

QAST::Stmts
A list of statements. Each child of the node is considered a separate statement.
QAST::Op
A single operation that usually maps to a primitive operation of the underlying platform, like adding two integers, or calling a routine.
QAST::IVal, QAST::NVal, QAST::SVal
Those hold integer, float ("numeric") and string constants respectively.
QAST::WVal
Holds a reference to a more complex object (for example a class) which is serialized separately.
QAST::Block
A list of statements that introduces a separate lexical scope.
QAST::Var
A variable
QAST::Want
A node that can evaluate to different child nodes, depending on the context it is compiled it.

To give you a bit of a feel of how those node types interact, I want to give a few examples of Perl 6 examples, and what AST they could produce. (It turns out that Perl 6 is quite a complex language under the hood, and usually produces a more complicated AST than the obvious one; I'll ignore that for now, in order to introduce you to the basics.)

Ops and Constants

The expression 23 + 42 could, in the simplest case, produce this AST:

QAST::Op.new(
    :op('add'),
    QAST::IVal.new(:value(23)),
    QAST::IVal.new(:value(42)),
);

Here an QAST::Op encodes a primitive operation, an addition of two numbers. The :op argument specifies which operation to use. The child nodes are two constants, both of type QAST::IVal, which hold the operands of the low-level operation add.

Now the low-level add operation is not polymorphic, it always adds two floating-point values, and the result is a floating-point value again. Since the arguments are integers and not floating point values, they are automatically converted to float first. That's not the desired semantics for Perl 6; actually the operator + is implemented as a subroutine of name &infix:<+>, so the real generated code is closer to

QAST::Op.new(
    :op('call'),
    :name('&infix:<+>'),    # name of the subroutine to call
    QAST::IVal.new(:value(23)),
    QAST::IVal.new(:value(42)),
);

Variables and Blocks

Using a variable is as simple as writing QAST::Var.new(:name('name-of-the-variable')), but it must be declared first. This is done with QAST::Var.new(:name('name-of-the-variable'), :decl('var'), :scope('lexical')).

But there is a slight caveat: in Perl 6 a variable is always scoped to a block. So while you can't ordinarily mention a variable prior to its declaration, there are indirect ways to achieve that (lookup by name, and eval(), to name just two).

So in Rakudo there is a convention to create QAST::Block nodes with two QAST::Stmts children. The first holds all the declarations, and the second all the actual code. That way all the declaration always come before the rest of the code.

So my $x = 42; say $x compiles to roughly this:

QAST::Block.new(
    QAST::Stmts.new(
        QAST::Var.new(:name('$x'), :decl('var'), :scope('lexical')),
    ),
    QAST::Stmts.new(
        QAST::Op.new(
            :op('p6store'),
            QAST::Var.new(:name('$x')),
            QAST::IVal.new(:value(42)),
        ),
        QAST::Op.new(
            :op('call'),
            :name('&say'),
            QAST::Var.new(:name('$x')),
        ),
    ),
);

Polymorphism and QAST::Want

Perl 6 distinguishes between native types and reference types. Native types are closer to the machine, and their type name is always lower case in Perl 6.

Integer literals are polymorphic in that they can be either a native int or a "boxed" reference type Int.

To model this in the AST, QAST::Want nodes can contain multiple child nodes. The compile-time context decides which of those is acutally used.

So the integer literal 42 actually produces not just a simple QAST::IVal node but rather this:

QAST::Want.new(
    QAST::WVal(Int.new(42)),
    'Ii',
    QAST::Ival(42),
)

(Note that Int.new(42) is just a nice notation to indicate a boxed integer object; it doesn't quite work like this in the code that translate Perl 6 source code into ASTs).

The first child of a QAST::Want node is the one used by default, if no other alternative matches. The comes a list where the elements with odd indexes are format specifications (here Ii for integers) and the elements at even-side indexes are the AST to use in that case.

An interesting format specification is 'v' for void context, which is always chosen when the return value from the current expression isn't used at all. In Perl 6 this is used to eagerly evaluate lazy lists that are used in void context, and for several optimizations.

Dave's Free Press: Journal: I Love Github

Dave's Free Press: Journal: Palm Treo call db module

Ocean of Awareness: Removing obsolete versions of Marpa from CPAN

Marpa::XS, Marpa::PP, and Marpa::HTML are obsolete versions of Marpa, which I have been keeping on CPAN for the convenience of legacy users. All new users should look only at Marpa::R2.

I plan to delete the obsolete releases from CPAN soon. For legacy users who need copies, they will still be available on backPAN.

I do this because their placement on CPAN placement makes them "attractive nuisances" -- they show up in searches and generally make it harder to find Marpa::R2, which is the version that new users should be interested in. There is also some danger a new user could, by mistake, use the obsolete versions instead of Marpa::R2.

It's been some time since someone has reported a bug in their code, so they should be stable for legacy applications. I would usually promise to fix serious bugs that affect legacy users, but unfortunately, especially in the case of Marpa::XS, it is a promise I would have trouble keeping. Marpa::XS depends on Glib, and uses a complex build which I last performed on a machine I no longer use for development.

For this reason, a re-release to CPAN with deprecatory language is also not an option. I probably would not want to do so anyway -- the CPAN infrastructure by default pushes legacy users into upgrading, which always carries some risk. New deprecatory language would add no value for the legacy users, and they are the only audience these releases exist to serve.

Comments

Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net. To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.

Perlgeek.de : A new Perl 6 community server - call for funding

So far, many Perl 6 developers have used feather as a generic development server. Juerd, who has genereously provided this server for us for free for many years, has announced that it will be shut down at the end of the year.

My daytime job is at a b2b IT outsourcing and hosting company called noris network, and they have agreed to sponsor the hosting/housing of a 1U 19" server in one of their state-of-the-art data centers in Nürnberg, Germany.

What's missing is the actual hardware. Some folks in the community have already agreed to participate in funding the hardware, though I have few concrete pledges.

So here is the call to action: If you want to help the Perl 6 community with a one-time donation towards a new community server, please send me an e-mail to moritz at faui2k3 dot org, specifying the amount you're willing do pledge, and whether you want to stay private as a donor. I accept money transfer by paypal and wire transfer (SWIFT). Direct hardware donations are also welcome. (Though actual money will be deferred until the final decision what hardware to buy, and thus the total amount required).

How much do we need?

Decent, used 1U servers seem to start at about 250€, though 350€ would get us a lot more bang (mostly RAM and hard disk space). And in general, the more the merrier. (Cheaper offers exist, for example on ebay, but usually they are without hard disks, so the need for extra drives makes them more expensive in total).

With more money, even beefier hardware and/or spare parts and/or a maintainance contract and/new hardware would be an option.

What do we need it for?

The main tasks for the server are:

  • Hosting websites like perl6.org and the synopses
  • Hosting infrastructure like the panda metadata server
  • Be available for smoke runs of the compilers, star distributions and module ecosystem.
  • Be available as a general development machine for people who don't have linux available and/or not enough resources to build some Perl 6 compilers on their own machines comfortably.
  • A place for IRC sessions for community memebers
  • A backup location for community services like the IRC logs, the camelia IRC eval bot etc. Those resources are currently hosted elswewhere, though having another option for hosting would be very valuable.
  • A webspace for people who want to host Perl 6-related material.
  • It is explicitly not meant as a general hosting platform, nor as a mail server.

    Configuration

    If the hardware we get is beefy enough, I'd like to virtualize the server into two to three components. One for hosting the perl6.org and related websites that should be rather stable, and one for the rest of the system. If resources allow it, and depending on feedback I get, maybe a third virtual system for high-risk stuff like evalbot.

    As operating system I'll install Debian Jessie (the current testing), simply because I'll end up maintaing the system, and it's the system I'm most familiar with.

Perlgeek.de : doc.perl6.org and p6doc

Background

Earlier this year I tried to assess the readiness of the Perl 6 language, compilers, modules, documentation and so on. While I never got around to publish my findings, one thing was painfully obvious: there is a huge gap in the area of documentation.

There are quite a few resources, but none of them comprehensive (most comprehensive are the synopsis, but they are not meant for the end user), and no single location we can point people to.

Announcement

So, in the spirit of xkcd, I present yet another incomplete documentation project: doc.perl6.org and p6doc.

The idea is to take the same approach as perldoc for Perl 5: create user-level documentation in Pod format (here the Perl 6 Pod), and make it available both on a website and via a command line tool. The source (documentation, command line tool, HTML generator) lives at https://github.com/perl6/doc/. The website is doc.perl6.org.

Oh, and the last Rakudo Star release (2012.06) already shipped p6doc.

Status and Plans

Documentation, website and command line tool are all in very early stages of development.

In the future, I want both p6doc SOMETHING and http://doc.perl6.org/SOMETHING to either document or link to documentation of SOMETHING, be it a built-in variable, an operator, a type name, routine name, phaser, constant or... all the other possible constructs that occur in Perl 6. URLs and command line arguments specific to each type of construct will also be available (/type/SOMETHING URLs already work).

Finally I want some way to get a "full" view of a type, ie providing all methods from superclasses and roles too.

Help Wanted

All of that is going to be a lot of work, though the most work will be to write the documentation. You too can help! You can write new documentation, gather and incorporate already existing documentation with compatible licenses (for example synopsis, perl 6 advent calendar, examples from rosettacode), add more examples, proof-read the documentation or improve the HTML generation or the command line tool.

If you have any questions about contributing, feel free to ask in #perl6. Of course you can also; create pull requests right away :-).

Perlgeek.de : YAPC Europe 2013 Day 2

The second day of YAPC Europe was enjoyable and informative.

I learned about ZeroMQ, which is a bit like sockets on steriods. Interesting stuff. Sadly Design decisions on p2 didn't quite qualify as interesting.

Matt's PSGI archive is a project to rewrite Matt's infamous script archive in modern Perl. Very promising, and a bit entertaining too.

Lunch was very tasty, more so than the usual mass catering. Kudos to the organizers!

After lunch, jnthn talked about concurrency, parallelism and asynchrony in Perl 6. It was a great talk, backed by great work on the compiler and runtime. Jonathans talk are always to be recommended.

I think I didn't screw up my own talk too badly, at least the timing worked fine. I just forgot to show the last slide. No real harm done.

I also enjoyed mst's State of the Velociraptor, which was a summary of what went on in the Perl world in the last year. (Much better than the YAPC::EU 2010 talk with the same title).

The Lightning talks were as enjoyable as those from the previous day. So all fine!

Next up is the river cruise, I hope to blog about that later on.

Perlgeek.de : Stop The Rewrites!

What follows is a rant. If you're not in the mood to read a rant right now, please stop and come back in an hour or two.

The Internet is full of people who know better than you how to manage your open source project, even if they only know some bits and pieces about it. News at 11.

But there is one particular instance of that advice that I hear often applied to Rakudo Perl 6: Stop the rewrites.

To be honest, I can fully understand the sentiment behind that advice. People see that it has taken us several years to get where we are now, and in their opinion, that's too long. And now we shouldn't waste our time with rewrites, but get the darn thing running already!

But Software development simply doesn't work that way. Especially not if your target is moving, as is Perl 6. (Ok, Perl 6 isn't moving that much anymore, but there are still areas we don't understand very well, so our current understanding of Perl 6 is a moving target).

At some point or another, you realize that with your current design, you can only pile workaround on top of workaround, and hope that the whole thing never collapses.

Picture of
a Jenga tower
Image courtesy of sermoa

Those people who spread the good advice to never do any major rewrites again, they never address what you should do when you face such a situation. Build the tower of workarounds even higher, and pray to Cthulhu that you can build it robust enough to support a whole stack of third-party modules?

Curiously this piece of advice occasionally comes from people who otherwise know a thing or two about software development methodology.

I should also add that since the famous "nom" switchover, which admittedly caused lots of fallout, we had three major rewrites of subsystems (longest-token matching of alternative, bounded serialization and qbootstrap), All three of which caused no new test failures, and two of which caused no fallout from the module ecosystem at all. In return, we have much faster startup (factor 3 to 4 faster) and a much more correct regex engine.

Perlgeek.de : The REPL trick

A recent discussion on IRC prompted me to share a small but neat trick with you.

If there are things you want to do quite often in the Rakudo REPL (the interactive "Read-Evaluate-Print Loop"), it makes sense to create a shortcut for them. And creating shortcuts for often-used stuff is what programming languages excel at, so you do it right in Perl module:

use v6;
module REPLHelper;

sub p(Mu \x) is export {
    x.^mro.map: *.^name;
}

I have placed mine in $HOME/.perl6/repl.

And then you make sure it's loaded automatically:

$ alias p6repl="perl6 -I$HOME/.perl6/repl/ -MREPLHelper"
$ p6repl
> p Int
Int Cool Any Mu
>

Now you have a neat one-letter function which tells you the parents of an object or a type, in method resolution order. And a way to add more shortcuts when you need them.

Perlgeek.de : News in the Rakudo 2012.06 release

Rakudo development continues to progress nicely, and so there are a few changes in this month's release worth explaining.

Longest Token Matching, List Iteration

The largest chunk of development effort went into Longest-Token Matching for alternations in Regexes, about which Jonathan already blogged. Another significant piece was Patrick's refactor of list iteration. You probably won't notice much of that, except that for-loops are now a bit faster (maybe 10%), and laziness works more reliably in a couple of cases.

String to Number Conversion

String to number conversion is now stricter than before. Previously an expression like +"foo" would simply return 0. Now it fails, ie returns an unthrown exception. If you treat that unthrown exception like a normal value, it blows up with a helpful error message, saying that the conversion to a number has failed. If that's not what you want, you can still write +$str // 0.

require With Argument Lists

require now supports argument lists, and that needs a bit more explaining. In Perl 6 routines are by default only looked up in lexical scopes, and lexical scopes are immutable at run time. So, when loading a module at run time, how do you make functions available to the code that loads the module? Well, you determine at compile time which symbols you want to import, and then do the actual importing at run time:

use v6;
require Test <&plan &ok &is>;
#            ^^^^^^^^^^^^^^^ evaluated at compile time,
#                            declares symbols &plan, &ok and &is
#       ^^^                  loaded at run time

Module Load Debugging

Rakudo had some trouble when modules were precompiled, but its dependencies were not. This happens more often than it sounds, because Rakudo checks timestamps of the involved files, and loads the source version if it is newer than the compiled file. Since many file operations (including simple copying) change the time stamp, that could happen very easily.

To make debugging of such errors easier, you can set the RAKUDO_MODULE_DEBUG environment variable to 1 (or any positive number; currently there is only one debugging level, in the future higher numbers might lead to more output).

$ RAKUDO_MODULE_DEBUG=1 ./perl6 -Ilib t/spec/S11-modules/require.t
MODULE_DEBUG: loading blib/Perl6/BOOTSTRAP.pbc
MODULE_DEBUG: done loading blib/Perl6/BOOTSTRAP.pbc
MODULE_DEBUG: loading lib/Test.pir
MODULE_DEBUG: done loading lib/Test.pir
1..5
MODULE_DEBUG: loading t/spec/packages/Fancy/Utilities.pm
MODULE_DEBUG: done loading t/spec/packages/Fancy/Utilities.pm
ok 1 - can load Fancy::Utilities at run time
ok 2 - can call our-sub from required module
MODULE_DEBUG: loading t/spec/packages/A.pm
MODULE_DEBUG: loading t/spec/packages/B.pm
MODULE_DEBUG: loading t/spec/packages/B/Grammar.pm
MODULE_DEBUG: done loading t/spec/packages/B/Grammar.pm
MODULE_DEBUG: done loading t/spec/packages/B.pm
MODULE_DEBUG: done loading t/spec/packages/A.pm
ok 3 - can require with variable name
ok 4 - can call subroutines in a module by name
ok 5 - require with import list

Module Loading Traces in Compile-Time Errors

If module myA loads module myB, and myB dies during compilation, you now get a backtrace which indicates through which path the erroneous module was loaded:

$ ./perl6 -Ilib -e 'use myA'
===SORRY!===
Placeholder variable $^x may not be used here because the surrounding block
takes no signature
at lib/myB.pm:1
  from module myA (lib/myA.pm:3)
  from -e:1

Improved autovivification

Perl allows you to treat not-yet-existing array and hash elements as arrays or hashes, and automatically creates those elements for you. This is called autovivification.

my %h;
%h<x>.push: 1, 2, 3; # worked in the previous release too
push %h<y>, 4, 5, 6; # newly works in the 2012.06

Dave's Free Press: Journal: Graphing tool

Dave's Free Press: Journal: Travelling in time: the CP2000AN

Dave's Free Press: Journal: XML::Tiny released

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 1

Dave's Free Press: Journal: Thanks, Yahoo!

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 2

Perlgeek.de : Pattern Matching and Unpacking

When talking about pattern matching in the context of Perl 6, people usually think about regex or grammars. Those are indeed very powerful tools for pattern matching, but not the only one.

Another powerful tool for pattern matching and for unpacking data structures uses signatures.

Signatures are "just" argument lists:

sub repeat(Str $s, Int $count) {
    #     ^^^^^^^^^^^^^^^^^^^^  the signature
    # $s and $count are the parameters
    return $s x $count
}

Nearly all modern programming languages have signatures, so you might say: nothing special, move along. But there are two features that make them more useful than signatures in other languages.

The first is multi dispatch, which allows you to write several routines with the name, but with different signatures. While extremely powerful and helpful, I don't want to dwell on them. Look at Chapter 6 of the "Using Perl 6" book for more details.

The second feature is sub-signatures. It allows you to write a signature for a sigle parameter.

Which sounds pretty boring at first, but for example it allows you to do declarative validation of data structures. Perl 6 has no built-in type for an array where each slot must be of a specific but different type. But you can still check for that in a sub-signature

sub f(@array [Int, Str]) {
    say @array.join: ', ';
}
f [42, 'str'];      # 42, str
f [42, 23];         # Nominal type check failed for parameter '';
                    # expected Str but got Int instead in sub-signature
                    # of parameter @array

Here we have a parameter called @array, and it is followed by a square brackets, which introduce a sub-signature for an array. When calling the function, the array is checked against the signature (Int, Str), and so if the array doesn't contain of exactly one Int and one Str in this order, a type error is thrown.

The same mechanism can be used not only for validation, but also for unpacking, which means extracting some parts of the data structure. This simply works by using variables in the inner signature:

sub head(*@ [$head, *@]) {
    $head;
}
sub tail(*@ [$, *@tail]) {
    @tail;
}
say head <a b c >;      # a
say tail <a b c >;      # b c

Here the outer parameter is anonymous (the @), though it's entirely possible to use variables for both the inner and the outer parameter.

The anonymous parameter can even be omitted, and you can write sub tail( [$, *@tail] ) directly.

Sub-signatures are not limited to arrays. For working on arbitrary objects, you surround them with parenthesis instead of brackets, and use named parameters inside:

multi key-type ($ (Numeric :$key, *%)) { "Number" }
multi key-type ($ (Str     :$key, *%)) { "String" }
for (42 => 'a', 'b' => 42) -> $pair {
    say key-type $pair;
}
# Output:
# Number
# String

This works because the => constructs a Pair, which has a key and a value attribute. The named parameter :$key in the sub-signature extracts the attribute key.

You can build quite impressive things with this feature, for example red-black tree balancing based on multi dispatch and signature unpacking. (More verbose explanation of the code.) Most use cases aren't this impressive, but still it is very useful to have occasionally. Like for this small evaluator.

Ocean of Awareness: Parsing: Top-down versus bottom-up

Comparisons between top-down and bottom-up parsing are often either too high-level or too low-level. Overly high-level treatments reduce the two approaches to buzzwords, and the comparison to a recitation of received wisdom. Overly low-level treatments get immersed in the minutiae of implementation, and the resulting comparison is as revealing as placing two abstractly related code listings side by side. In this post I hope to find the middle level; to shed light on why advocates of bottom-up and top-down parsing approaches take the positions they do; and to speculate about the way forward.

Top-down parsing

The basic idea of top-down parsing is as brutally simple as anything in programming: Starting at the top, we add pieces. We do this by looking at the next token and deciding then and there where it fits into the parse tree. Once we've looked at every token, we have our parse tree.

In its purest form, this idea is too simple for practical parsing, so top-down parsing is almost always combined with lookahead. Lookahead of one token helps a lot. Longer lookaheads are very sparsely used. They just aren't that helpful, and since the number of possible lookaheads grows exponentially, they get very expensive very fast.

Top-down parsing has an issue with left recursion. It's straightforward to see why. Take an open-ended expression like

    a + b + c + d + e + f + [....]

Here the plus signs continue off to the right, and adding any of them to the parse tree requires a dedicated node which must be above the node for the first plus sign. We cannot put that first plus sign into a top-down parse tree without having first dealt with all those plus signs that follow it. For a top-down strategy, this is a big, big problem.

Even in the simplest expression, there is no way of counting the plus signs without looking to the right, quite possibly a very long way to the right. When we are not dealing with simple expressions, this rightward-looking needs to get sophisticated. There are ways of dealing with this difficulty, but all of them share one thing in common -- they are trying to make top-down parsing into something that it is not.

Advantages of top-down parsing

Top-down parsing does not look at the right context in any systematic way, and in the 1970's it was hard to believe that top-down was as good as we can do. (It's not all that easy to believe today.) But its extreme simplicity is also top-down parsing's great strength. Because a top-down parser is extremely simple, it is very easy to figure out what it is doing. And easy to figure out means easy to customize.

Take another of the many constructs incomprehensible to a top-down parser:

    2 * 3 * 4 + 5 * 6
    

How do top-down parsers typically handle this? Simple: as soon as they realize they are faced with an expression, they give up on top-down parsing and switch to a special-purpose algorithm.

These two properties -- easy to understand and easy to customize -- have catapulted top-down parsing to the top of the heap. Behind their different presentations, combinator parsing, PEG, and recursive descent are all top-down parsers.

Bottom-up parsing

Few theoreticians of the 1970's imagined that top-down parsing might be the end of the parsing story. Looking to the right in ad hoc ways clearly does help. It would be almost paradoxical if there was no systematic way to exploit the right context.

In 1965, Don Knuth found an algorithm to exploit right context. Knuth's LR algorithm was, like top-down parsing as I have described it, deterministic. Determinism was thought to be essential -- allowing more than one choice easily leads to a combinatorial explosion in the number of possibilities that have to be considered at once. When parsers are restricted to dealing with a single choice, it is much easier to guarantee that they will run in linear time.

Knuth's algorithm did not try to hang each token from a branch of a top-down parse tree as soon as it was encountered. Instead, Knuth suggested delaying that decision. Knuth's algorithm collected "subparses".

When I say "subparses" in this discussion, I mean pieces of the parse that contain all the decisions necessary to construct the part of the parse tree that is below them. But subparses do not contain any decisions about what is above them in the parse tree. Put another way, subparses know who they are, but not where they belong.

Subparses may not know where they belong, but knowing who they are is enough for them to be assembled into larger subparses. And, if we keep assembling the subparses, eventually we will have a "subparse" that is the full parse tree. And at that point we will know both who everyone is and where everyone belongs.

Knuth's algorithm stored subparses by shifting them onto a stack. The operation to do this was called a "shift". (Single tokens of the input are treated as subparses with a single node.) When there was enough context to build a larger subparse, the algorithm popped one or more subparses off the stack, assembled a larger subparse, and put the resulting subparse back on the stack. This operation was called a "reduce", based on the idea that its repeated application eventually "reduces" the parse tree to its root node.

In handling the stack, we will often be faced with choices. One kind of choice is between using what we already have on top of the stack to assemble a larger subparse; or pushing more subparses on top of the stack instead ("shift/reduce"). When we decide to reduce, we may encounter the other kind of choice -- we have to decide which rule to use ("reduce/reduce").

Like top-down parsing, bottom-up parsing is usually combined with lookahead. For the same lookahead, a bottom-up parser parses everything that a top-down parser can handle, and more.

Formally, Knuth's approach is now called shift/reduce parsing. I want to demonstrate why theoreticians, and for a long time almost everybody else as well, was so taken with this method. I'll describe how it works on some examples, including two very important ones that stump top-down parsers: arithmetic expressions and left-recursion. My purpose here is bring to light the basic concepts, and not to guide an implementor. There are excellent implementation-oriented presentations in many other places. The Wikipedia article, for example, is excellent.

Bottom-up parsing solved the problem of left recursion. In the example from above,

    a + b + c + d + e + f + [....]

we simply build one subparse after another, as rapidly as we can. In the terminology of shift/reduce, whenever we can reduce, we do. Eventually we will have run out of tokens, and will have reduced until there is only one element on the stack. That one remaining element is the subparse that is also, in fact, our full parse tree.

The top-down parser had a problem with left recursion precisely because it needed to build top-down. To build top-down, it needed to know about all the plus signs to come, because these needed to be fitted into the parse tree above the current plus sign. But when building bottom-up, we don't need to know anything about the plus signs that will be above the current one in the parse tree. We can afford to wait until we encounter them.

But if working bottom-up solves the left recursion problem, doesn't it create a right recursion problem? In fact, for a bottom-up parser, right recursion is harder, but not much. That's because of the stack. For a right recursion like this:

    a = b = c = d = e = f = [....]

we use a strategy opposite to the one we used for the left recursion. For left recursion, we reduced whenever we could. For right recursion, when we have a choice, we always shift. This means we will immediately shift the entire input onto the stack. Once the entire input is on the stack, we have no choice but to start reducing. Eventually we will reduce the stack to a single element. At that point, we are done. Essentially, what we are doing is exactly what we did for left recursion, except that we use the stack to reverse the order.

Arithmetic expressions like

    2 * 3 * 4 + 5 * 6

require a mixed strategy. Whenever we have a shift/reduce choice, and one of the operators is on the stack, we check to see if the topmost operator is a multiply or an addition operator. If it is a multiply operator, we reduce. In all other cases, if there is a shift/reduce choice, we shift.

In the discussion above, I have pulled the strategy for making stack decisions (shift/reduce and reduce/reduce) out of thin air. Clearly, if bottom-up parsing was going to be a practical parsing algorithm, the stack decisions would have to be made algorithmically. In fact, discovering a practical way to do this was a far from trivial task. The solution in Knuth's paper was considered (and apparently intended) to be mathematically provocative, rather than practical. But by 1979, it was thought a practical way to make stack decisions had been found and yacc, a parser generator based on bottom-up parsing, was released. (Readers today may be more familiar with yacc's successor, bison.)

The fate of bottom-up parsing

With yacc, it looked as if the limitations of top-down parsing were past us. We now had a parsing algorithm that could readily and directly parse left and right recursions, as well as arithmetic expressions. Theoreticians thought they'd found the Holy Grail.

But not all of the medieval romances had happy endings. And as I've described elsewhere, this story ended badly. Bottom-up parsing was driven by tables which made the algorithm fast for correct inputs, but unable to accurately diagnose faulty ones. The subset of grammars parsed was still not quite large enough, even for conservative language designers. And bottom-up parsing was very unfriendly to custom hacks, which made every shortcoming loom large. It is much harder to work around a problem in a bottom-up parser than than it is to deal with a similar shortcoming in a top-down parser. After decades of experience with bottom-up parsing, top-down parsing has re-emerged as the algorithm of choice.

Non-determinism

For many, the return to top-down parsing answers the question that we posed earlier: "Is there any systematic way to exploit right context when parsing?" So far, the answer seems to be a rather startling "No". Can this really be the end of the story?

It would be very strange if the best basic parsing algorithm we know is top-down. Above, I described at some length some very important grammars that can be parsed bottom-up but not top-down, at least not directly. Progress like this seems like a lot to walk away from, and especially to walk back all the way to what is essentially a brute force algorithm. This perhaps explains why lectures and textbooks persist in teaching bottom-up parsing to students who are very unlikely to use it. Because the verdict from practitioners seems to be in, and likely to hold up on appeal.

Fans of deterministic top-down parsing, and proponents of deterministic bottom-up parsing share an assumption: For a practical algorithm to be linear, it has to be deterministic. But is this actually the case?

It's not, in fact. To keep bottom-up parsing deterministic, we restricted ourselves to a stack. But what if we track all possible subpieces of parses? For efficiency, we can link them and put them into tables, making the final decisions in a second pass, once the tables are complete. (The second pass replaces the stack-driven see-sawing back and forth of the deterministic bottom-up algorithm, so it's not an inefficiency.) Jay Earley in 1968 came up with an algorithm to do this, and in 1991 Joop Leo added a memoization to Earley's algorithm which made it linear for all deterministic grammars.

The "deterministic grammars" are exactly the bottom-up parseable grammars with lookahead -- the set of grammars parsed by Knuth's algorithm. So that means the Earley/Leo algorithm parses, in linear time, everything that a deterministic bottom-up parser can parse, and therefore every grammar that a deterministic top-down parser can parse. (In fact, the Earley/Leo algorithm is linear for a lot of ambiguous grammars as well.)

Top-down parsing had the advantage that it was easy to know where you are. The Earley/Leo algorithm has an equivalent advantage -- its tables know where it is, and it is easy to query them programmatically. In 2010, this blogger modified the Earley/Leo algorithm to have the other big advantage of top-down parsing: The Marpa algorithm rearranges the Earley/Leo parse engine so that we can stop it, perform our own logic, and restart where we left off. A quite useable parser based on the Marpa algorithm is available as open source.

Comments

Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net. To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.

Perlgeek.de : YAPC Europe 2013 Day 3

The second day of YAPC Europe climaxed in the river boat cruise, Kiev's version of the traditional conference dinner. It was a largish boat traveling on the Dnipro river, with food, drinks and lots of Perl folks. Not having fixed tables, and having to get up to fetch food and drinks led to a lot of circulation, and thus meeting many more people than at traditionally dinners. I loved it.

Day 3 started with a video message from next year's YAPC Europe organizers, advertising for the upcoming conference and talking a bit about the oppurtunities that Sofia offers. Tempting :-).

Monitoring with Perl and Unix::Statgrab was more about the metrics that are available for monitoring, and less about doing stuff with Perl. I was a bit disappointed.

The "Future Perl Versioning" Discussion was a very civilized discussion, with solid arguments. Whether anybody changed their minds remain to be seen.

Carl Mäsak gave two great talks: one on reactive programming, and one on regular expressions. I learned quite a bit in the first one, and simply enjoyed the second one.

After the lunch (tasty again), I attended Jonathan Worthington's third talk, MoarVM: a metamodel-focused runtime for NQP and Rakudo. Again this was a great talk, based on great work done by Jonathan and others during the last 12 months or so. MoarVM is a virtual machine designed for Perl 6's needs, as we understand them now (as opposed to parrot, which was designed towards Perl 6 as it was understood around 2003 or so, which is considerably different).

How to speak manager was both amusing and offered a nice perspective on interactions between managers and programmers. Some of this advice assumed a non-tech-savy manager, and thus didn't quite apply to my current work situation, but was still interesting.

I must confess I don't remember too much of the rest of the talks that evening. I blame five days of traveling, hackathon and conference taking their toll on me.

The third session of lightning talks was again an interesting mix, containing interesting technical tidbits, the usual "we are hiring" slogans, some touching and thoughtful moments, and finally a song by Piers Cawley. He had written the lyrics in the previous 18 hours (including sleep), to (afaict) a traditional irish song. Standing up in front of ~300 people and singing a song that you haven't really had time to practise takes a huge amount of courage, and I admire Piers both for his courage and his great performance. I hope it was recorded, and makes it way to the public soon.

Finally the organizers spoke some closing words, and received their well-deserved share of applause.

As you might have guess from this and the previous blog posts, I enjoyed this year's YAPC Europe very much, and found it well worth attending, and well organized. I'd like to give my heart-felt thanks to everybody who helped to make it happen, and to my employer for sending me there.

This being only my second YAPC, I can't make any far-reaching comparisons, but compared to YAPC::EU 2010 in Pisa I had an easier time making acquaintances. I cannot tell what the big difference was, but the buffet-style dinners at the pre-conference meeting and the river boat cruise certainly helped to increase the circulation and thus the number of people I talked to.

Dave's Free Press: Journal: YAPC::Europe 2007 travel plans

Perlgeek.de : A small regex optimization for NQP and Rakudo

Recently I read the course material of the Rakudo and NQP Internals Workshop, and had an idea for a small optimization for the regex engine. Yesterday night I implemented it, and I'd like to walk you through the process.

As a bit of background, the regex engine that Rakudo uses is actually implemented in NQP, and used by NQP too. The code I am about to discuss all lives in the NQP repository, but Rakudo profits from it too.

In addition one should note that the regex engine is mostly used for parsing grammar, a process which involves nearly no scanning. Scanning is the process where the regex engine first tries to match the regex at the start of the string, and if it fails there, moves to the second character in the string, tries again etc. until it succeeds.

But regexes that users write often involve scanning, and so my idea was to speed up regexes that scan, and where the first thing in the regex is a literal. In this case it makes sense to find possible start positions with a fast string search algorithm, for example the Boyer-Moore algorithm. The virtual machine backends for NQP already implement that as the index opcode, which can be invoked as start = index haystack, needle, startpos, where the string haystack is searched for the substring needle, starting from position startpos.

From reading the course material I knew I had to search for a regex type called scan, so that's what I did:

$ git grep --word scan
3rdparty/libtommath/bn_error.c:   /* scan the lookup table for the given message
3rdparty/libtommath/bn_mp_cnt_lsb.c:   /* scan lower digits until non-zero */
3rdparty/libtommath/bn_mp_cnt_lsb.c:   /* now scan this digit until a 1 is found
3rdparty/libtommath/bn_mp_prime_next_prime.c:                   /* scan upwards 
3rdparty/libtommath/changes.txt:       -- Started the Depends framework, wrote d
src/QRegex/P5Regex/Actions.nqp:                     QAST::Regex.new( :rxtype<sca
src/QRegex/P6Regex/Actions.nqp:                     QAST::Regex.new( :rxtype<sca
src/vm/jvm/QAST/Compiler.nqp:    method scan($node) {
src/vm/moar/QAST/QASTRegexCompilerMAST.nqp:    method scan($node) {
Binary file src/vm/moar/stage0/NQPP6QRegexMoar.moarvm matches
Binary file src/vm/moar/stage0/QASTMoar.moarvm matches
src/vm/parrot/QAST/Compiler.nqp:    method scan($node) {
src/vm/parrot/stage0/P6QRegex-s0.pir:    $P5025 = $P5024."new"("scan" :named("rx
src/vm/parrot/stage0/QAST-s0.pir:.sub "scan" :subid("cuid_135_1381944260.6802") 
src/vm/parrot/stage0/QAST-s0.pir:    push $P5004, "scan"

The binary files and .pir files are generated code included just for bootstrapping, and not interesting for us. The files in 3rdparty/libtommath are there for bigint handling, thus not interesting for us either. The rest are good matches: src/QRegex/P6Regex/Actions.nqp is responsible for compiling Perl 6 regexes to an abstract syntax tree (AST), and src/vm/parrot/QAST/Compiler.nqp compiles that AST down to PIR, the assembly language that the Parrot Virtual Machine understands.

So, looking at src/QRegex/P6Regex/Actions.nqp the place that mentions scan looked like this:

    $block<orig_qast> := $qast;
    $qast := QAST::Regex.new( :rxtype<concat>,
                 QAST::Regex.new( :rxtype<scan> ),
                 $qast,
                 ($anon
                      ?? QAST::Regex.new( :rxtype<pass> )
                      !! (nqp::substr(%*RX<name>, 0, 12) ne '!!LATENAME!!'
                            ?? QAST::Regex.new( :rxtype<pass>, :name(%*RX<name>) )
                            !! QAST::Regex.new( :rxtype<pass>,
                                   QAST::Var.new(
                                       :name(nqp::substr(%*RX<name>, 12)),
                                       :scope('lexical')
                                   ) 
                               )
                          )));

So to make the regex scan, the AST (in $qast) is wrapped in QAST::Regex.new(:rxtype<concat>,QAST::Regex.new( :rxtype<scan> ), $qast, ...), plus some stuff I don't care about.

To make the optimization work, the scan node needs to know what to scan for, if the first thing in the regex is indeed a constant string, aka literal. If it is, $qast is either directly of rxtype literal, or a concat node where the first child is a literal. As a patch, it looks like this:

--- a/src/QRegex/P6Regex/Actions.nqp
+++ b/src/QRegex/P6Regex/Actions.nqp
@@ -667,9 +667,21 @@ class QRegex::P6Regex::Actions is HLL::Actions {
     self.store_regex_nfa($code_obj, $block, QRegex::NFA.new.addnode($qast))
     self.alt_nfas($code_obj, $block, $qast);
 
+    my $scan := QAST::Regex.new( :rxtype<scan> );
+    {
+        my $q := $qast;
+        if $q.rxtype eq 'concat' && $q[0] {
+            $q := $q[0]
+        }
+        if $q.rxtype eq 'literal' {
+            nqp::push($scan, $q[0]);
+            $scan.subtype($q.subtype);
+        }
+    }
+
     $block<orig_qast> := $qast;
     $qast := QAST::Regex.new( :rxtype<concat>,
-                 QAST::Regex.new( :rxtype<scan> ),
+                 $scan,
                  $qast,

Since concat nodes have always been empty so far, the code generators don't look at their child nodes, and adding one with nqp::push($scan, $q[0]); won't break anything on backends that don't support this optimization yet (which after just this patch were all of them). Running make test confirmed that.

My original patch did not contain the line $scan.subtype($q.subtype);, and later on some unit tests started to fail, because regex matches can be case insensitive, but the index op works only case sensitive. For case insensitive matches, the $q.subtype of the literal regex node would be ignorecase, so that information needs to be carried on to the code generation backend.

Once that part was in place, and some debug nqp::say() statements confirmed that it indeed worked, it was time to look at the code generation. For the parrot backend, it looked like this:

    method scan($node) {
        my $ops := self.post_new('Ops', :result(%*REG<cur>));
        my $prefix := self.unique('rxscan');
        my $looplabel := self.post_new('Label', :name($prefix ~ '_loop'));
        my $scanlabel := self.post_new('Label', :name($prefix ~ '_scan'));
        my $donelabel := self.post_new('Label', :name($prefix ~ '_done'));
        $ops.push_pirop('repr_get_attr_int', '$I11', 'self', %*REG<curclass>, '"$!from"');
        $ops.push_pirop('ne', '$I11', -1, $donelabel);
        $ops.push_pirop('goto', $scanlabel);
        $ops.push($looplabel);
        $ops.push_pirop('inc', %*REG<pos>);
        $ops.push_pirop('gt', %*REG<pos>, %*REG<eos>, %*REG<fail>);
        $ops.push_pirop('repr_bind_attr_int', %*REG<cur>, %*REG<curclass>, '"$!from"', %*REG<pos>);
        $ops.push($scanlabel);
        self.regex_mark($ops, $looplabel, %*REG<pos>, 0);
        $ops.push($donelabel);
        $ops;
    }

While a bit intimidating at first, staring at it for a while quickly made clear what kind of code it emits. First three labels are generated, to which the code can jump with goto $label: One as a jump target for the loop that increments the cursor position ($looplabel), one for doing the regex match at that position ($scanlabel), and $donelabel for jumping to when the whole thing has finished.

Inside the loop there is an increment (inc) of the register the holds the current position (%*REG<pos>), that position is compared to the end-of-string position (%*REG<eos>), and if is larger, the cursor is marked as failed.

So the idea is to advance the position by one, and then instead of doing the regex match immediately, call the index op to find the next position where the regex might succeed:

--- a/src/vm/parrot/QAST/Compiler.nqp
+++ b/src/vm/parrot/QAST/Compiler.nqp
@@ -1564,7 +1564,13 @@ class QAST::Compiler is HLL::Compiler {
         $ops.push_pirop('goto', $scanlabel);
         $ops.push($looplabel);
         $ops.push_pirop('inc', %*REG<pos>);
-        $ops.push_pirop('gt', %*REG<pos>, %*REG<eos>, %*REG<fail>);
+        if nqp::elems($node.list) && $node.subtype ne 'ignorecase' {
+            $ops.push_pirop('index', %*REG<pos>, %*REG<tgt>, self.rxescape($node[0]), %*REG<pos>);
+            $ops.push_pirop('eq', %*REG<pos>, -1, %*REG<fail>);
+        }
+        else {
+            $ops.push_pirop('gt', %*REG<pos>, %*REG<eos>, %*REG<fail>);
+        }
         $ops.push_pirop('repr_bind_attr_int', %*REG<cur>, %*REG<curclass>, '"$!from"', %*REG<pos>);
         $ops.push($scanlabel);
         self.regex_mark($ops, $looplabel, %*REG<pos>, 0);

The index op returns -1 on failure, so the condition for a cursor fail are slightly different than before.

And as mentioned earlier, the optimization can only be safely done for matches that don't ignore case. Maybe with some additional effort that could be remedied, but it's not as simple as case-folding the target string, because some case folding operations can change the string length (for example ß becomes SS while uppercasing).

After successfully testing the patch, I came up with a small, artifical benchmark designed to show a difference in performance for this particular case. And indeed, it sped it up from 647 ± 28 µs to 161 ± 18 µs, which is roughly a factor of four.

You can see the whole thing as two commits on github.

What remains to do is implementing the same optimization on the JVM and MoarVM backends, and of course other optimizations. For example the Perl 5 regex engine keeps track of minimal and maximal string lengths for each subregex, and can anchor a regex like /a?b?longliteral/ to 0..2 characters before a match of longliteral, and generally use that meta information to fail faster.

But for now I am mostly encouraged that doing a worthwhile optimization was possible in a single evening without any black magic, or too intimate knowledge of the code generation.

Update: the code generation for MoarVM now also uses the index op. The logic is the same as for the parrot backend, the only difference is that the literal needs to be loaded into a register (whose name fresh_s returns) before index_s can use it.

Perlgeek.de : Quo Vadis Perl?

The last two days we had a gathering in town named Perl (yes, a place with that name exists). It's a lovely little town next to the borders to France and Luxembourg, and our meeting was titled "Perl Reunification Summit".

Sadly I only managed to arrive in Perl on Friday late in the night, so I missed the first day. Still it was totally worth it.

We tried to answer the question of how to make the Perl 5 and the Perl 6 community converge on a social level. While we haven't found the one true answer to that, we did find that discussing the future together, both on a technical and on a social level, already brought us closer together.

It was quite a touching moment when Merijn "Tux" Brand explained that he was skeptic of Perl 6 before the summit, and now sees it as the future.

We also concluded that copying API design is a good way to converge on a technical level. For example Perl 6's IO subsystem is in desperate need of a cohesive design. However none of the Perl 6 specification and the Rakudo development team has much experience in that area, and copying from successful Perl 5 modules is a viable approach here. Path::Class and IO::All (excluding the crazy parts) were mentioned as targets worth looking at.

There is now also an IRC channel to continue our discussions -- join #p6p5 on irc.perl.org if you are interested.

We also discussed ways to bring parallel programming to both perls. I missed most of the discussion, but did hear that one approach is to make easier to send other processes some serialized objects, and thus distribute work among several cores.

Patrick Michaud gave a short ad-hoc presentation on implicit parallelism in Perl 6. There are several constructs where the language allows parallel execution, for example for Hyper operators, junctions and feeds (think of feeds as UNIX pipes, but ones that allow passing of objects and not just strings). Rakudo doesn't implement any of them in parallel right now, because the Parrot Virtual Machine does not provide the necessary primitives yet.

Besides the "official" program, everybody used the time in meat space to discuss their favorite projects with everybody else. For example I took some time to discuss the future of doc.perl6.org with Patrick and Gabor Szabgab, and the relation to perl6maven with the latter. The Rakudo team (which was nearly completely present) also discussed several topics, and I was happy to talk about the relation between Rakudo and Parrot with Reini Urban.

Prior to the summit my expectations were quite vague. That's why it's hard for me to tell if we achieved what we and the organizers wanted. Time will tell, and we want to summarize the result in six to nine months. But I am certain that many participants have changed some of their views in positive ways, and left the summit with a warm, fuzzy feeling.

I am very grateful to have been invited to such a meeting, and enjoyed it greatly. Our host and organizers, Liz and Wendy, took care of all of our needs -- travel, food, drinks, space, wifi, accommodation, more food, entertainment, food for thought, you name it. Thank you very much!

Update: Follow the #p6p5 hash tag on twitter if you want to read more, I'm sure other participants will blog too.

Other blogs posts on this topic: PRS2012 – Perl5-Perl6 Reunification Summit by mdk and post-yapc by theorbtwo

Dave's Free Press: Journal: Wikipedia handheld proxy

Dave's Free Press: Journal: Bryar security hole

Dave's Free Press: Journal: POD includes

Dave's Free Press: Journal: cgit syntax highlighting

Perlgeek.de : First day at YAPC::Europe 2013 in Kiev

Today was the first "real" day of YAPC Europe 2013 in Kiev. In the same sense that it was the first real day, we had quite a nice "unreal" conference day yesterday, with a day-long Perl 6 hackathon, and in the evening a pre-conference meeting a Sovjet-style restaurant with tasty food and beverages.

The talks started with a few words of welcome, and then the announcement that the YAPC Europe next year will be in Sofia, Bulgaria, with the small side note that there were actually three cities competing for that honour. Congratulations to Sofia!

Larry's traditional keynote was quite emotional, and he had to fight tears a few times. Having had cancer and related surgeries in the past year, he still does his perceived duty to the Perl community, which I greatly appreciate.

Afterwards Dave Cross talked about 25 years of Perl in 25 minutes, which was a nice walk through some significant developments in the Perl world, though a bit hasty. Maybe picking fewer events and spending a bit more time on the selected few would give a smoother experience.

Another excellent talk that ran out of time was on Redis. Having experimented a wee bit with Redis in the past month, this was a real eye-opener on the wealth of features we might have used for a project at work, but in the end we didn't. Maybe we will eventually revise that decision.

Ribasushi talked about how hard benchmarking really is, and while I was (in principle) aware of that fact that it's hard to get right, there were still several significant factors that I overlooked (like the CPU's tendency to scale frequency in response to thermal and power-management considerations). I also learned that I should use Dumbbench instead of the Benchmark.pm core module. Sadly it didn't install for me (Capture::Tiny tests failing on Mac OS X).

The Perl 6 is dead, long live Perl 5 talk was much less inflammatory than the title would suggest (maybe due to Larry touching on the subject briefly during the keynote). It was mostly about how Perl 5 is used in the presenter's company, which was mildly interesting.

After tasty free lunch I attended jnthn's talk on Rakudo on the JVM, which was (as is typical for jnthn's talk) both entertaining and taught me something, even though I had followed the project quite a bit.

Thomas Klausner's Bread::Board by example made me want to refactor the OTRS internals very badly, because it is full of the anti-patterns that Bread::Board can solve in a much better way. I think that the OTRS code base is big enough to warrant the usage of Bread::Board.

I enjoyed Denis' talk on Method::Signatures, and was delighted to see that most syntax is directly copied from Perl 6 signature syntax. Talk about Perl 6 sucking creativity out of Perl 5 development.

The conference ended with a session of lighning talks, something which I always enjoy. Many lightning talks had a slightly funny tone or undertone, while still talking about interesting stuff.

Finally there was the "kick-off party", beverages and snacks sponsored by booking.com. There (and really the whole day, and yesterday too) I not only had conversations with my "old" Perl 6 friends, but also talked with many interesting people I never met before, or only met online before.

So all in all it was a nice experience, both from the social side, and from quality and contents of the talks. Venue and food are good, and the wifi too, except when it stops working for a few minutes.

I'm looking forward to two more days of conference!

(Updated: Fixed Thomas' last name)

Ocean of Awareness: What makes a parsing algorithm successful?

What makes a parsing algorithm successful? Two factors, I think. First, does the algorithm parse a workably-defined set of grammars in linear time? Second, does it allow the application to intervene in the parse with custom code? When parsing algorithms are compared, typically neither of these gets much attention. But the successful algorithms do one or the other.

Does the algorithm parse a workably-defined set of grammars in linear time?

"Workably-defined" means more than well-defined in the mathematical sense -- the definition has to be workable. That is, the definition must be something that, with reasonable effort, a programmer can use in practice.

The algorithms in regular expression engines are workably-defined. A regular expression, in the pure sense consists of a sequence of symbols, usually shown by concatenation:

a b c

or a choice among sequences, usually shown by a vertical bar:

a | b | c

or a repetition of any of the above, typically shown with a star:

a*

or any recursive combination of these. True, if this definition is new to you, it can take time to get used to. But vast numbers of working programming are very much "used to it", can think in terms of regular expressions, and can determine if a particular problem will yield to treatment as a regular expression, or not.

Parsers in the LALR family (yacc, bison, etc.) do not have a workably defined set of grammars. LALR is perfectly well-defined mathematically, but even experts in parsing theory are hard put to decide if a particular grammar is LALR.

Recursive descent also does not have a workably defined set of grammars. Recursive descent doesn't even have a precise mathematical description -- you can say that recursive descent is LL, but in practice LL tables are rarely used. Also in practice, the LL logic is extended with every other trick imaginable, up to and including switching to other parsing algorithms.

Does it allow the user to intervene in the parse?

It is not easy for users to intervene in the processing of a regular expression, though some implementations attempt to allow such efforts. LALR parsers are notoriously opaque. Those who maintain the LALR-driven Perl parser have tried to supplement its abilities with custom code, with results that will not encourage others making the same attempt.

Recursive descent, on the other hand, has no parse engine -- it is 100% custom code. You don't get much friendlier than that.

Conclusions

Regular expressions are a success, and will remain so, because the set of grammars they handle is very workably-defined. Applications using regular expressions have to take what the algorithm gives them, but what it gives them is very predictable.

For example, an application can write regular expressions on the fly, and the programmer can be confident they will run as long as they are well-formed. And it is easy to determine if the regular expression is well-formed. (Whether it actually does what you want is a separate issue.)

Recursive descent does not handle a workably-defined set of grammars, and it also has to be hand-written. But it makes up for this by allowing the user to step into the parsing process anywhere, and "get his hands dirty". Recursive descent does nothing for you, but it does allow you complete control. This is enough to make recursive descent the current algorithm of choice for major parsing projects.

As I have chronicled elsewhere, LALR was once, on highly convincing theoretical grounds, seen as the solution to the parsing problem. But while mathematically well-defined, LALR was not workably defined. And it was very hostile to applications that tried to alter, or even examine, its syntax-driven workings. After decades of trying to make it work, the profession has abandoned LALR almost totally.

What about Marpa?

Marpa has both properties: its set of grammars is workably-defined. And, while Marpa is syntax-driven like LALR and regular expressions, it also allows the user to stop the parse engine, communicate with it about the state of the parse, do her own parsing for a while, and restart the parse engine at any point she wants.

Marpa's workable definition has a nuance that the one for regular expressions does not. For regular expressions, linearity is a given -- they parse in linear time or fail. Marpa parses a much larger class of grammars, the context-free grammars -- anything that can be written in BNF. BNF is used to describe languages in standards, and is therefore itself a kind of "gold standard" for a workable definition of a set of grammars. However, Marpa does not parse everything that can be written in BNF in linear time.

Marpa linearly-parsed set of grammars is smaller than the context-free grammars, but it is still very large, and it is still workably-defined. Marpa will parse any unambiguous language in linear time, unless it contains unmarked middle recursions. An example of a "marked" middle recursion is the language described by

S ::= a S a | x

a string of which is "aaaxaaa", where the "x" marks the middle. An example of an "unmarked" middle recursion is the language described by

S ::= a S a | a

a string of which is "aaaaaaa", where nothing marks the middle, so that you don't know until the end where the middle of the recursion is. If a human can reliably find the middle by eyeball, the middle recursion is marked. If a human can't, then the middle recursion might be unmarked.

Marpa also parses a large set of unambiguous grammars linearly, and this set of grammars is also workably-defined. Marpa parses an ambiguous grammar in linear time if

  • It has no unmarked middle recursions.
  • All right recursions are unambiguous.
  • There are no cycles. A cycle occurs, for example, if there is a rule A ::= A in the grammar.
  • Marpa's level of ambiguity at any location is bounded by a constant.

The term "level of ambiguity" requires a bit of explanation. At any given location, there can be as many rules "in play" as you like, without affecting the level of ambiguity. The key question: What is the maximum number of different origins that a rule might have? (The "origin" of a rule is the location where it begins.) That is, can a rule currently in play have at most 20 different origins? Or could it have its origin at every location so far? If the maximum number of origins is 20 or any other fixed constant, the level of ambiguity is "bounded". But if the maximum number of origins keeps growing as the length of the input grows, the level of ambiguity is unbounded.

For the unambiguous case, Marpa's workable definition encompasses a much larger class of grammars, but is no more complex than that for regular expressions. If you want to extend even further, and work with ambiguous grammars, the definition remains quite workable. Of the four restrictions needed to ensure linearity, the one requiring a bounded level of ambiguity is the only one that might force you to exercise real vigliance -- once you get into ambiguity, unboundedness is easy to slip into.

As for the other three, cycles never occur in a practical grammars, and Marpa reports them, so that you simply fix them when they happen. Most recursions will be left recursions, which are unrestricted. My experience has been that, in practical grammars, unmarked middle recursions and ambiguous right recursions are not especially tempting features. If you note whenever you use a right recursion, checking that it is not ambiguous, and if you note whenever you use a middle recursion, checking that it is marked, then you will stay linear.

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.

Comments

Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: CPAN Testers' CPAN author FAQ

Ocean of Awareness: Parsing: a timeline

[ Revised 22 Oct 2014 ]

1960: The ALGOL 60 spec comes out. It specifies, for the first time, a block structured language. The ALGOL committee is well aware that nobody knows how to parse such a language. But they believe that, if they specify a block-structured language, a parser for it will be invented. Risky as this approach is, it pays off ...

1961: Ned Irons publishes his ALGOL parser. In fact, the Irons parser is the first parser of any kind to be described in print. Ned's algorithm is a left parser -- a form of recursive descent. Unlike modern recursive descent, the Irons algorithm is general and syntax-driven. "General" means it can parse anything written in BNF. "Syntax-driven" (aka declarative) means that parser is actually created from the BNF -- the parser does not need to be hand-written.

1961: Almost simultaneously, hand-coded approaches to left parsing appear. These we would now recognize as recursive descent. Over the following years, hand-coding approaches will become more popular for left parsers than syntax-driven algorithms. Three factors are at work:

  • In 1960's, memory and CPU are both extremely limited. Hand-coding pays off, even when the gains are small.
  • Pure left parsing is a very weak parsing technique. Hand-coding is often necessary to overcome its limits. This is as true today as it is in 1961.
  • Left parsing works well in combination with hand-coding -- they are a very good fit.

1965: Don Knuth invents LR parsing. Knuth is primarily interested in the mathematics. Knuth describes a parsing algorithm, but it is not thought practical.

1968: Jay Earley invents the algorithm named after him. Like the Irons algorithm, Earley's algorithm is syntax-driven and fully general. Unlike the Irons algorithm, it does not backtrack. Earley's core idea is to track everything about the parse in tables. Earley's algorithm is enticing, but it has three major issues:

  • First, there is a bug in the handling of zero-length rules.
  • Second, it is quadratic for right recursions.
  • Third, the bookkeeping required to set up the tables is, by the standards of 1968 hardware, daunting.

1969: Frank DeRemer describes a new variant of Knuth's LR parsing. DeRemer's LALR algorithm requires only a stack and a state table of quite manageable size.

1972: Aho and Ullmann describe a straightforward fix to the zero-length rule bug in Earley's original algorithm. Unfortunately, this fix involves adding even more bookkeeping to Earley's.

1975: Bell Labs converts its C compiler from hand-written recursive descent to DeRemer's LALR algorithm.

1977: The first "Dragon book" comes out. This soon-to-be classic textbook is nicknamed after the drawing on the front cover, in which a knight takes on a dragon. Emblazoned on the knight's lance are the letters "LALR". From here on out, to speak lightly of LALR will be to besmirch the escutcheon of parsing theory.

1979: Bell Laboratories releases Version 7 UNIX. V7 includes what is, by far, the most comprehensive, useable and easily available compiler writing toolkit yet developed. Central to the toolkit is yacc, an LALR based parser generator. With a bit of hackery, yacc parses its own input language, as well as the language of V7's main compiler, the portable C compiler. After two decades of research, it seems that the parsing problem is solved.

1987: Larry Wall introduces Perl 1. Perl embraces complexity like no previous language. Larry uses LALR very aggressively -- to my knowledge more aggressively than anyone before or since.

1991: Joop Leo discovers a way of speeding up right recursions in Earley's algorithm. Leo's algorithm is linear for just about every unambiguous grammar of practical interest, and many ambiguous ones as well. In 1991 hardware is six orders of magnitude faster than 1968 hardware, so that the issue of bookkeeping overhead had receded in importance. This is a major discovery. When it comes to speed, the game has changed in favor of Earley algorithm. But Earley parsing is almost forgotten. It will be 20 years before anyone writes a practical implementation of Leo's algorithm.

1990's: Earley's is forgotten. So everyone in LALR-land is content, right? Wrong. Far from it, in fact. Users of LALR are making unpleasant discoveries. While LALR automatically generates their parsers, debugging them is so hard they could just as easily write the parser by hand. Once debugged, their LALR parsers are fast for correct inputs. But almost all they tell the users about incorrect inputs is that they are incorrect. In Larry's words, LALR is "fast but stupid".

2000: Larry Wall decides on a radical reimplementation of Perl -- Perl 6. Larry does not even consider using LALR again.

2002: Aycock&Horspool publish their attempt at a fast, practical Earley's parser. Missing from it is Joop Leo's improvement -- they seem not to be aware of it. Their own speedup is limited in what it achieves and the complications it introduces can be counter-productive at evaluation time. But buried in their paper is a solution to the zero-length rule bug. And this time the solution requires no additional bookkeeping.

2006: GNU announces that the GCC compiler's parser has been rewritten. For three decades, the industry's flagship C compilers have used LALR as their parser -- proof of the claim that LALR and serious parsing are equivalent. Now, GNU replaces LALR with the technology that it replaced a quarter century earlier: recursive descent.

2000 to today: With the retreat from LALR comes a collapse in the prestige of parsing theory. After a half century, we seem to be back where we started. If you took Ned Iron's original 1961 algorithm, changed the names and dates, and translated the code from the mix of assembler and ALGOL into Haskell, you would easily republish it today, and bill it as as revolutionary and new.

Marpa

Over the years, I had come back to Earley's algorithm again and again. Around 2010, I realized that the original, long-abandoned vision -- an efficient, practical, general and syntax-driven parser -- was now, in fact, quite possible. The necessary pieces had fallen into place.

Aycock&Horspool have solved the zero-length rule bug. Joop Leo had found the speedup for right recursion. And the issue of bookkeeping overhead had pretty much evaporated on its own. Machine operations are now a billion times faster than in 1968, and probably no longer relevant in any case -- caches misses are now the bottleneck.

But while the original issues with Earley's disappeared, a new issue emerged. With a parsing algorithm as powerful as Earley's behind it, a syntax-driven approach can do much more than it can with a left parser. But with the experience with LALR in their collective consciousness, few modern programmers are prepared to trust a purely declarative parser. As Lincoln said, "Once a cat's been burned, he won't even sit on a cold stove."

To be accepted, Marpa needed to allow procedure parsing, not just declarative parsing. So Marpa allows the user to specify events -- occurrences of symbols and rules -- at which declarative parsing pauses. While paused, the application can call procedural logic and single-step forward token by token. The procedural logic can hand control back over to syntax-driven parsing at any point it likes. The Earley tables can provide the procedural logic with full knowledge of the state of the parse so far: all rules recognized in all possible parses so far, and all symbols expected. Earley's algorithm is now a even better companion for hand-written procedural logic than recursive descent.

For more

For more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group.

Perlgeek.de : Correctness in Computer Programs and Mathematical Proofs

While reading On Proof and Progress in Mathematics by Fields Medal winner Bill Thurston (recently deceased I was sorry to hear), I came across this gem:

The standard of correctness and completeness necessary to get a computer program to work at all is a couple of orders of magnitude higher than the mathematical community’s standard of valid proofs. Nonetheless, large computer programs, even when they have been very carefully written and very carefully tested, always seem to have bugs.

I noticed that mathematicians are often sloppy about the scope of their symbols. Sometimes they use the same symbol for two different meanings, and you have to guess from context which on is meant.

This kind of sloppiness generally doesn't have an impact on the validity of the ideas that are communicated, as long as it's still understandable to the reader.

I guess on reason is that most mathematical publications still stick to one-letter symbol names, and there aren't that many letters in the alphabets that are generally accepted for usage (Latin, Greek, a few letters from Hebrew). And in the programming world we snort derisively at FORTRAN 77 that limited variable names to a length of 6 characters.

Dave's Free Press: Journal: Thankyou, Anonymous Benefactor!

Ocean of Awareness: Reporting mismatched delimiters

In many contexts, programs need to identify non-overlapping pieces of a text. One very direct way to do this is to use a pair of delimiters. One delimiter of the pair marks the start and the other marks the end. Delimiters can take many forms: Quote marks, parentheses, curly braces, square brackets, XML tags, and HTML tags are all delimiters in this sense.

Mismatching delimiters is easy to do. Traditional parsers are often poor at reporting these errors: hopeless after the first mismatch, and for that matter none too precise about the first one. This post outlines a scaleable method for the accurate reporting of mismatched delimiters. I will illustrate the method with a simple but useable tool -- a utility which reports mismatched brackets.

The example script

The example script, bracket.pl, reports mismatched brackets in the set:

() {} []

They are expected to nest without overlaps. Other text is treated as filler. bracket.pl is not smart about things like strings or comments. This does have the advantage of making bracket.pl mostly language-agnostic.

Because it's intended primarily to be read as an illustration of the technique, bracket.pl's grammar is a basic one. The grammar that bracket.pl uses is so simple that an emulator of bracket.pl could be written using recursive descent. I hope the reader who goes on to look into the details will see that this technique scales to more complex situations, in a way that a solution based on a traditional parser will not.

Error reports

The description of how the method works will make more sense after we've looked at some examples of the diagnostics bracket.pl produces. To be truly useful, bracket.pl must report mismatches that span many lines, and it can do this. But single-line examples are easier to follow. All the examples in this post will be contained in a one line. Consider the string '((([))'. bracket.pl's diagnostics are:

* Line 1, column 1: Opening '(' never closed, problem detected at end of string
((([))
^
====================
* Line 1, column 4: Missing close ], problem detected at line 1, column 5
((([))
   ^^

In the next example bracket.pl realizes that it cannot accept the ')' at column 16, without first closing the set of curly braces started at column 5. It identifies the problem, along with both of the locations involved.

* Line 1, column 5: Missing close }, problem detected at line 1, column 16
[({({x[]x{}x()x)})]
    ^          ^

So far, so good. But an important advantage of bracket.pl has yet to be seen. Most compilers, once they report a first mismatched delimiter, produce error messages that are unreliable -- so unreliable that they are useless in practice. bracket.pl repairs a mismatched bracket before continuing, so that it can do a reasonable job of analyzing the text that follows. Consider the text '({]-[(}-[{)'. The output of bracket.pl is

* Line 1, column 1: Missing close ), problem detected at line 1, column 3
({]-[(}-[{)
^ ^
====================
* Line 1, column 2: Missing close }, problem detected at line 1, column 3
({]-[(}-[{)
 ^^
====================
* Line 1, column 3: Missing open [
({]-[(}-[{)
  ^
====================
* Line 1, column 5: Missing close ], problem detected at line 1, column 7
({]-[(}-[{)
    ^ ^
====================
* Line 1, column 6: Missing close ), problem detected at line 1, column 7
({]-[(}-[{)
     ^^
====================
* Line 1, column 7: Missing open {
({]-[(}-[{)
      ^
====================
* Line 1, column 9: Missing close ], problem detected at line 1, column 11
({]-[(}-[{)
        ^ ^
====================
* Line 1, column 10: Missing close }, problem detected at line 1, column 11
({]-[(}-[{)
         ^^
====================
* Line 1, column 11: Missing open (
({]-[(}-[{)
          ^

Each time, bracket.pl corrects itself, and accurately reports the next set of problems.

A difficult error report

To be 100% accurate, bracket.pl would have to guess the programmer's intent. This is, of course, not possible. Let's look at a text where bracket.pl's guesses are not so good: {{]}. Here we will assume the closing square bracket is a typo for a closing parenthesis. Here's the result:

* Line 1, column 1: Missing close }, problem detected at line 1, column 3
{{]}
^ ^
====================
* Line 1, column 2: Missing close }, problem detected at line 1, column 3
{{]}
 ^^
====================
* Line 1, column 3: Missing open [
{{]}
  ^
====================
* Line 1, column 4: Missing open {
{{]}
   ^

Instead of one error, bracket.pl finds four.

But even in this case, the method is fairly good, especially when compared with current practice. The problem is at line 1, column 3, and the first three messages all identify this as one of their potential problem locations. It is reasonable to believe that a programmer, especially once he becomes used to this kind of mismatch reporting, will quickly find the first mismatch and fix it. For this difficult case, bracket.pl may not be much better than the state of the art, but it is certainly no worse.

How it works

For full details of the workings of bracket.pl there is the code, which is heavily commented. This section provides a conceptual overview.

bracket.pl uses two features of Marpa: left-eideticism and the Ruby Slippers. By left-eidetic, I mean that Marpa knows everything there is to know about the parse at, and to left of, the current position. As a consequence, Marpa also knows exactly which of its input symbols can lead to a successful parse, and is able to stop as soon as it knows that the parse cannot succeed.

In the Ruby Slippers technique, we arrange for parsing to stop whenever we encounter an input which would cause parsing to fail. The application then asks Marpa, "OK. What input would allow the parse to continue?" The application takes Marpa's answer to this question, and uses it to concoct an input that Marpa will accept.

In this case, bracket.pl creates a virtual token which fixes the mismatch of brackets. Whatever the missing bracket may be, bracket.pl invents a bracket of that kind, and adds it to the virtual input. This done, parsing and error detection can proceed as if there was no problem. Of course, the error which made the Ruby Slippers token necessary is recorded, and those records are the source of the error reports we saw above.

To make its error messages as informative as possible in the case of missing closing brackets, bracket.pl needs to report the exact location of the opening bracket. Left-eideticism again comes in handy here. Once the virtual closing bracket is supplied to Marpa, bracket.pl asks, "That bracketed text that I just closed -- where did it begin?" The Marpa parser tracks the start location of all symbol and rule instances, so it is able to provide the application with the exact location of the starting bracket.

When bracket.pl encounters a problem at a point where there are unclosed opening brackets, it has two choices. It can be optimistic or it can be pessimistic. "Optimistic" means it can hope that something later in the input will close the opening bracket. "Pessimistic" means it can decide that "all bets are off" and use Ruby Slippers tokens to close all the currently active open brackets.

bracket.pl uses the pessimistic strategy. While the optimistic strategy sounds better, in practice the pessimistic one seems to provide better diagnostics. The pessimistic strategy does report some fixable problems as errors. But the optimistic one can introduce spurious fixes. These hide the real errors, and it is worse to miss errors than it is to overreport them. Even when the pessimistic strategy overreports, its first error message will always accurately identify the first problem location.

While bracket.pl is already useable, I think of it as a prototype. Beyond that, the problem of matching delimiters is in fact very general, and I believe these techniques may have very wide application.

For more

The example script of this post is a Github gist. For more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group.

Dave's Free Press: Journal: Number::Phone release

Dave's Free Press: Journal: Ill

Dave's Free Press: Journal: CPANdeps upgrade

Dave's Free Press: Journal: YAPC::Europe 2006 report: day 3

Perlgeek.de : iPod nano 5g on linux -- works!

For Christmas I got an iPod nano (5th generation). Since I use only Linux on my home computers, I searched the Internet for how well it is supported by Linux-based tools. The results looked bleak, but they were mostly from 2009.

Now (December 2012) on my Debian/Wheezy system, it just worked.

The iPod nano 5g presents itself as an ordinary USB storage device, which you can mount without problems. However simply copying files on it won't make the iPod show those files in the play lists, because there is some meta data stored on the device that must be updated too.

There are several user-space programs that allow you to import and export music from and to the iPod, and update those meta data files as necessary. The first one I tried, gtkpod 2.1.2, worked fine.

Other user-space programs reputed to work with the iPod are rhythmbox and amarok (which both not only organize but also play music).

Although I don't think anything really depends on some particular versions here (except that you need a new enough version of gtkpod), here is what I used:

  • Architecture: amd64
  • Linux: 3.2.0-4-amd64 #1 SMP Debian 3.2.35-2
  • Userland: Debian GNU/Linux "Wheezy" (currently "testing")
  • gtkpod: 2.1.2-1
Header image by Tambako the Jaguar. Some rights reserved.