Joel Berger: Reflections on Test2

In a future post I will recount the details of my delightful experience at the 2016 Perl QA Hackathon. Since this is my first post since that time I do want to tip my hat to the great sponsors of the event and to my own employer ServerCentral without whom I would not have been able to attend. I will thank them in more detail in that post.

Before I get to that however, I want to post a reflection on one discussion that is and has been weighing on my mind since then. That topic is the upcoming release of Test2, which I consider to be a very important step forward for Perl’s testing architecture.

While I didn’t spend the majority of my time at the QA Hackathon thinking about Test2, I think it probably is the most important part of the event to discuss. Test2 is a major project to rewrite the guts of the venerable Perl testing library Test::Builder. Most users are blissfully unaware of Test::Builder, they use Test::More instead, but Test::More uses Test::Builder as do almost all (let’s say all) test modules.

Test::Builder draws its interface from the Test Anything Protocol (TAP). The protocol is seemingly very simple; print the number of tests you expect to run, then print ‘ok 1’ to say the first test passes and so on. I can report three passing tests by printing

1..3
ok 1
ok 2
ok 3

But most test authors don’t print their test results, they use a function like Test::More::ok to print the test result. They use something like plan or done_testing to manage the number of tests they expect. This is because it is annoying to have to keep track of the current test number, especially once you start sharing testing code.

Test::Builder, the Test::More guts, keeps an internal record of the number of tests that have been run, so that you don’t have to. While this seems useful it sows the seeds of problems that we are finally starting to understand. Test::Builder’s internal state is fundamentally important to the running of the test. This very quickly leads to the realization that all the test modules have to play nicely with Test::Builder or the count skews and your tests won’t report correctly and your test runner will see a failure.

Our problem now is that Test::Builder is showing its age. People want to write tests in different and more expressive ways. They want to use subtests (groups of tests that count together), they want to output other formats than just TAP, they want to test their test modules, they want their harness (runner) output to be pretty. Test::Builder can’t really support these things well, but that doesn’t meant that people haven’t gotten those features by hook or by crook.

My Involvement in Test2

Most readers will be aware that I am a core developer of Mojolicious and author of many related CPAN modules. Test::Mojo::Role::Phantom (I’ll refer to it as ::Phantom for brevity) is a module which combines Test::Mojo and the headless Javascript interpreter Phantom.js to greatly simplify testing of a web application’s dynamic (content). Note that via Test::Mojo::Role::PSGI these other test modules can be used to test Catalyst, Dancer, or any other PSGI-compliant application.

When it starts, it spins up a Mojolicious service on a random local port (possibly shimmed to host a PSGI app). It then forks a phantom Javascript interpreter; the child’s STDOUT is a pipe pointed back at the parent. Via this pipe the javascript can send back a JSON structure representing command to be executed back in the parent. These commands are intended to run various test functions in the parent as they come in. You might say

perl('Test::More::is', some_selected_text, 'expected result', 'description')

Which back in the Perl process is executed as

Test::More::is(@remaining_arguments);

This mechanism works very well. While not a replacement for more dedicated front-end testing tools, helps authors who wouldn’t otherwise write any front-end tests to be able to start. But ::Phantom has some problems that I cannot directly solve.

The test script itself might only make one call to $t->phantom_ok($url, $js) but the execution of the phantom interpreter might execute several tests. In this scenario the call to phantom_ok is logically equivalent to a subtest and so it wraps the fork and subsequent listen/interpret-cycle in a subtest. However if the test fails, the failure is reported as emanating from inside phantom_ok. This problem alone wouldn’t be tremendously difficult to fix, you could simply tell Test::Builder to report test failures from a higher stack frame by doing the usual trick of

local $Test::Builder::Level = $Test::Builder::Level + 1;

For ::Phantom however this will not help. The reason is that since the parent still has to serve the page to the client and then the parent has to listen for calls from the child’s handle, it has to perform the test inside of an IO loop. Those of you familiar with IO loops can now see the problem. Now when the test is executed the original caller’s stack frame isn’t one level above the current one. In fact the original caller isn’t (really) in the call stack at all (and certainly isn’t a predictable number of frames above it.

This means that Test::Build’s interface for this task is too limiting. Now up until Chad Granum started working on Test2 the next step would be to hack Test::Builder in one of several nasty ways. In fact many many test modules on CPAN have already done this because they had no other choice. These hacks range from mild to horrendous and can even lead to test modules which conflict with each other in strange ways or break unexpectedly when Test::More is updated seemingly innocuously.

Actually this stretches the truth; Test::More really isn’t ever updated because in practice too much breaks when that is attempted. Test::More is actually more like completely stuck, frozen by the hacks that people have perpetuated around it.

I am lucky though because Chad has been working on a replacement, Test2. You see Chad (and many others before him) have noticed this problem of people hacking Test::More and wanted a solution. Test2 has been specially (and painstakingly) written to provide public api mechanisms to replace as many of the hacks that were seen on CPAN as possible. The only reason those hacks existed was because there was no approved way to do what test modules authors needed.

In my case Test2 provides a mechanism to capture a context object from which the test successes and failures are emitted. When you think about it this is the much more natural way to tell the testing singleton what is desired. Rather than

sub my_cool_test_ok {
  ... # arbitrary depth of calls ...
  # count the number of intervening stack frames $n
  local $Test::Builder::Level = $Test::Builder::Level + $n
  ok(...);
  ...
}

You can simply do

use Test2::API 'context';

sub my_cool_test_ok {
  my $ctx = context();
  ... # arbitrary depth of calls ...
  $ctx->ok(...);
  ...
}

and the tests will be reported correctly.

Test2 to the Rescue

My situation isn’t uncommon. I want to use Test::Builder in a way that wasn’t originally envisioned.

Test2 contains many such fixes for many abuse patterns, both common and uncommon. In fact the reason it has taken so long is that over the YEARS that this has been in development Chad has continued to find modules that did stranger and stranger things to Test::Builder and then found ways to support them! The Test::Builder interface doesn’t change when Test2 replaces its guts. This includes things that were never intended to be public interface but have defacto become public api over the course of all of this hacking.

There were a few cases that modules on CPAN could not be supported but most of those have now been updated to use supportable apis in Test::Builder. This leaves only a very small handful of only the most unsupportable ones. Once Test::Builder-on-Test2 hits CPAN nearly everyone will upgrade transparently thanks to this herculean effort. Not all of that praise goes to Chad as many others have submitted reports and offered suggestions and yes even held him back at times, but a huge amount does. The final product is a stunning level of backwards support that is worthy of the Perl language’s commitment to this goal.

Test2 is More Than Just A Way Forward

Here you have seen my argument for Test2 as the successor to Test::Builder from the perspective of having a more useful public api. It is important to note that this wasn’t the original motivation.

Test2 brings one other huge benefit, an event system! When you run ok using Test2 you don’t just print ‘ok $n’ anymore, you emit a passing test event. For most users this will simply be consumed by a TAP emitter and it will print ‘ok $n’, but this isn’t all you can do.

When testing Test:: modules written with Test2 no longer have to parse the TAP stream to see if they pass their own tests. Test2’s robust system of events can be watched and then compared as data structure streams. I can write tests for ::Phantom without ever having to regexp on TAP output!

We can finally test our testers!

Not only that but those event streams don’t need to emit TAP at all! Users desirous of xUnit, TeamCity or other formats no longer need to parse the output and re-emit what they need. Simply attach emitters for the preferred output and Test2 does what you need!

Why Not Opt-In?

The first response of anyone new to the situation is to ask “well can’t the test script author or at least the module author choose to opt-in rather than replace the guts transparently?”

Remember up at the top of the discussion that there is this central point of knowledge of the state of the test run? Any Test:: module can practically only target one authority, whether it be Test::Builder or Test2. Sure some module authors could provide one version of a module for each, but that places a burden on them and a cognitive load on the consumer to pick all their test modules from one pool or the other. In practice this will not occur, the inertia behind Test::Builder is too great. Test::Builder is the existing authority and it must remain as a least a view into the state of test progress for all existing tests.

This means that we are left with a question of can one authority delegate to the other? Test::Builder is too limited in its scope to continue to improve marginally, let alone be the substructure of its own replacement, it simply doesn’t have the capacity; Test2-on-Test::Builder isn’t possible.

No, from those axioms we can see that for Test2 to gain any adoption, it must replace Test::Builder in holding the authority and provide a new Test::Builder as the interface to older tests. Which leads us back to opt-in. If users have to opt-in, then tests written with and for Test2 cannot communicate with older tests, leading inevitably back to the first case of a forked ecosystem.

Now there is no guarantee of perfection, but if you had been following this process you wouldn’t doubt the lengths to which compatibility has motivated it. I know I would have quit long long ago had it been me! At this point, most of the detractors have been converted and the few holdouts seem to mostly have the fear of the unknown for effects on the DarkPAN. Fair enough; this represents a judgement call. Perfect backwards compatibility by atrophy versus progress with some small risk almost entirely mitigated at incredible effort.

I know where I stand on that call.

The Results of Discussions at QAH

The discussions around the Test::Builder-on-Test2 centered around end-user notification. It seems that most if not all of the technical discussion has wrapped up and the code has been well reviewed.

To help the users of the few affected modules, it was decided to list ALL of the known broken modules (even those that are now fixed by upgrading from CPAN) in a document. This document (part of the Test2 distribution) will tell those few affected users what they should do if they experience trouble. Further, when installing Test::Builder-on-Test2 the user will be notified if they have any existing modules which need upgrade or even replacement. Remember though, this list is very very small.

The benefits of Test2 are huge. Let’s see it make this final step and become the new backbone of Perl’s great testing infrastructure.

dagolden: Stand up and be counted: Annual MongoDB Developer Survey

If you use Perl and MongoDB, I need your help. Every year, we put out a survey to developers to find out what languages they use, what features they need, what problems they have, and so on.

We have very few Perl responses. ☹️

Be an ally! Take the MongoDB Developer Experience Survey.

Camel

NEILB: The end of modulelist permissions

Andreas König and I have been working to remove the modulelist permissions from the PAUSE database. At the QA Hackathon we worked through the remaining cases, where relevant reviewing them with RJBS, and most of them were removed on the last day of the QAH. Following the QAH we've resolved the last handful, so there are no longer any 'm' permissions in 06perms.txt. This means that the relevant parts of PAUSE can be removed, and a number of modules can be simplified.

Hacking Thy Fearful Symmetry: Groom That Yak

Here's a super quick one.

So, for giggles I'm learning Swift. And for practice, I'm using exercism.io. A few hours ago, I wrote my first test and stuff:


$ swiftc *.swift; and ./main 
Test Suite 'All tests' started at 00:22:13.561
Test Suite 'hamming.xctest' started at 00:22:13.562
Test Suite 'HammingTest' started at 00:22:13.562
Test Case 'HammingTest.testNoDifferenceBetweenEmptyStrands' started at 00:22:13.562
Test Case 'HammingTest.testNoDifferenceBetweenEmptyStrands' passed (0.0 seconds).
Test Case 'HammingTest.testNoDifferenceBetweenIdenticalStrands' started at 00:22:13.562
Test Case 'HammingTest.testNoDifferenceBetweenIdenticalStrands' passed (0.0 seconds).
Test Case 'HammingTest.testCompleteHammingDistanceInSmallStrand' started at 00:22:13.563
Test Case 'HammingTest.testCompleteHammingDistanceInSmallStrand' passed (0.0 seconds).
Test Case 'HammingTest.testSmallHammingDistanceInMiddleSomewhere' started at 00:22:13.563
Test Case 'HammingTest.testSmallHammingDistanceInMiddleSomewhere' passed (0.0 seconds).
Test Case 'HammingTest.testLargerDistance' started at 00:22:13.563
Test Case 'HammingTest.testLargerDistance' passed (0.0 seconds).
Test Case 'HammingTest.testReturnsNilWhenOtherStrandLonger' started at 00:22:13.563
HammingTest.swift:39: error: HammingTest.testReturnsNilWhenOtherStrandLonger : XCTAssertNil failed: "1" - Different length strands return nil
Test Case 'HammingTest.testReturnsNilWhenOtherStrandLonger' failed (0.0 seconds).
Test Case 'HammingTest.testReturnsNilWhenOriginalStrandLonger' started at 00:22:13.563
HammingTest.swift:44: error: HammingTest.testReturnsNilWhenOriginalStrandLonger : XCTAssertNil failed: "1" - Different length strands return nil
Test Case 'HammingTest.testReturnsNilWhenOriginalStrandLonger' failed (0.0 seconds).
Test Suite 'HammingTest' failed at 00:22:13.563
        Executed 7 tests, with 2 failures (0 unexpected) in 0.0 (0.0) seconds
Test Suite 'hamming.xctest' failed at 00:22:13.563
        Executed 7 tests, with 2 failures (0 unexpected) in 0.0 (0.0) seconds
Test Suite 'All tests' failed at 00:22:13.563
        Executed 7 tests, with 2 failures (0 unexpected) in 0.0 (0.0) seconds

That's fine, but holy wall of text, Batman... It's not the worst thing I ever had to peer at (I'm looking at you, all and every java logging system ever conceived), but it's still not the most readable thing ever.

So... I'm probably re-inventing a perfectly fine wheel existing somewhere else (and if I am, please enlight me in the comment section), but I blurped a quick tool called groom. The idea is to define regexes in a config file to act on lines that I have colored or altered.

For example, for those Swift test results I created the following groom.yaml rule file.


# show the testcase names in glorious green
- 
    "^Test Case":
        eval: |
            s/(?<=')(+)/colored [ 'green' ], $1 /e;
        fallthrough: 1
# remove the 'started' lines altogether
- started: 
    eval: $_ = ''
# victory or defeat, unicoded and colored
- passed: 
    color: blue
    only_match: 1
    eval: s/^/✔ /
- failed: 
    color: red
    only_match: 1
    eval: s/^/❌ /
# final lines underlined for emphasis
-
    "^\s+Executed": rgb202 underline

Which gives me

groomed screenshot

Purty, eh?

Sawyer X: Perl 5 Porters Mailing List Summary: April 14th-27th

Hey everyone,

Following is the p5p (Perl 5 Porters) mailing list summary for two weeks I've missed. Sorry about that. Enjoy!

April 14th-27th

Correction: Previous summary stated that Perl #121734 was resolved. It was not. Thanks, Tony, for the correction!

Edit: Vincent Pit notes that there was no discussion about extracting parts of Scope::Upper (as suggested in the summary), but instead there was a single comment on writing a subset of it. This post was corrected. Thank you, Vincent.

News and updates

A lot has happened during these two weeks since the previous summaries went out.

Ricardo Signes stepped down from the role of the Perl Pumpking. Feel free to offer words of praise and thanks.

Sawyer X is the next pumpking.

Ricardo Signes released Perl 5.24.0-RC3. You can read more in the release announcement.

Dave Mitchell provides grant #2 reports #123 and #124 available here. Most of his time was spent on getting Scope::Upper to work on Perl 5.23.8 and above.

Tony Cook provides grant 7's 3rd and 4th.

More from Tony Cook, a summary of the March grant work.

Issues

New issues

Resolved issues

  • Perl #113644: Panic error in perl5db.pl.
  • Perl #125584: Mysterious taint issue in Bugzilla4Intranet.
  • Perl #127709: Documentation problem with links and perlpod, podchecker.
  • Perl #127894: -DDEBUGGING -Dusequadmath -Dusethreads builds crash early.
  • Perl #127899: Extra slash in perldelta example in 5.22.2-RC1 and 5.24.0-RC1 confusing.
  • Perl #127936: sprintf typo in 5.24 perldelta.

Proposed patches

Jim Keenan provides a patch for Perl #127391 in order to move forward with the documentation issue.

John Lightsey provided a patch in Perl #127923 to add blacklist and whitelist functionality to Locale::Maketext.

Jerry D. Hedden provided patches to upgrade threads to 2.06, Thread::Queue to 3.08, and threads::shared to 1.51.

Yves Orton provided a patch for Perl #123562, a problem with regular expressions possibly hanging on CPU 100%, which is considered a security problem.

Aaron Crane provided a patch for RT #100183, but since 5.24 is already at RC releases, it is frozen and the patch will get in on version 5.25. You can read Aaron's comment here.

Matthew Horsfall provided a patch relating to Perl #126579, warnings about newlines in open.

Matthew Horsfall also provided a patch for Perl #124050, t/harness.t can mistakenly run tests outside the perl source tree.

Aristotle Pagaltzis provided a patch to clean up Module::CoreList.

Aristotle also provided a patch to fix Perl #127981.

Discussion

Todd Rinaldo raised RT#127810 to provide a -Dfortify_inc Configure option to control the current directory appearing in @INC. The conversation around it continued further.

Zefram provides a detailed explaination about an observation made by Slaven Rezić in Perl #127909.

Sisyphus raised a confusing bit of documentation, which was fixed and backported to 5.22.

Sisyphus also asks about the binary name expected for make.exe. Bulk88 explains that it is easier to have it called gmake.exe to know what options it supports, and Leon Timmermans suggests it is possible to address it.

Dave Mitchell discussed his work on Scope::Upper. It seems Dave was able to get most of it working, but due to how the module works, Dave does not believe any sensible API can be shield the module from breakage. One comment surfaced the idea of writing a subset of its functionality.

Maxwell Carey asks what could cause a problem described on Stack Overflow with a failure to print.

In an email to the list, Sisyphus asks about the current state of ExtUtils::MakeMaker with relation to the current version in blead vs. CPAN.

The conversation about changing how the signatures feature worked with relation to @_ started by Dave Mitchell continues. Ricardo Signes provided a summary of Zefram's position and his conclusions.

Karl Williamson suggests POSIX::set_locale refusing to switch to a locale we know will cause a libc crash.

Smylers asks in Perl #122551 whether Term::ReadLine should not use Term::ReadLine::Perl as the default.

The conversation in Perl #127232 continues. You can read more here.

Renée Baecker mentions that perllol still reflects autoderef, which was removed, and should be updated.

Ed Avis suggested in Perl #127993 to add version control conflict markers, so Perl could warn you correctly when you forgot merge conflict markers in your code.

Ricardo Signes: I went to the Perl QA Hackathon in Rugby!

I've long said that the Perl QA Hackathon is my favorite professional event of the year. It's better than any conference, where there are some good talks and some bad talks, and some nice socializing. At the Perl QA Hackathon, stuff gets done. I usually leave feeling like a champ, and that was generally the case this time, too.

I flew over with Matt Horsfall, and the trip was fine. We got to the hotel in the early afternoon, settled in, played some chess (stalemate) and then got dinner with the folks there so far. I was delighted to get a (totally adequate) steak and ale pie. Why can't I find these in Philly? No idea.

steak and ale pie!!

The next day, we got down to business quickly. We started, as usual, with about thirty seconds of introduction from each person, and then we shut up and got to work. This year, we had most of a small hotel entirely to ourselves. This gave us a dining room, a small banquet hall, a meeting room, and a bar. I spent most of my time in the banquet hall, near the window. It seemed like the easiest place to work. Although there were many good things about the hotel, the chairs were not one of them! Still, it worked out just fine.

The MetaCPAN crew were in the dining room, a few people stayed at the bar seating most of the time, and the board room got used by various meetings, most of which I attended.

The view over my shoulder most of the time, though, was this:

getting to work!

Philippe wasn't always there with that camera, though. Just most of the time.

I think my work falls into three categories: Dist::Zilla work, meeting work, and pairing work.

Dist::Zilla

Two big releases of Dist::Zilla came out of the QAH. One was v5.047 (and the two preceeding it), which closed about 40 issues and pull requests. Some of those just needed application, but others needed tests, or rework, or review, or whatever. Amusingly enough, other people at the QAH were working on Dist::Zilla issues, so as I tried to close out the obvious branches, more easy-to-merge branches kept popping up!

Eventually I got down to things that I didn't think I could handle and moved on to my next big Dist::Zilla task for the day: version six!

My goal with Dist::Zilla has been to have a new major version every year or two, breaking backward compatibility if needed, to fix things that seemed worth fixing. I've been very clear that while I value backcompat quite a lot in most software, Dist::Zilla will remain something of a wild west, where I will consider nothing sacred, if it gets me a big win. The biggest change for v6 was replacing Dist::Zilla's use of Path::Class with Path::Tiny. This was not a huge win, except insofar as it lets me focus on knowing and using a single API. It's also a bit faster, although it's hard to notice that under Dist::Zilla's lumbering pace.

Karen Etheridge and I puzzled over some encoding issues, specifically around PPI. The PPI plugin had changed, about two years ago, to passing octets rather than characters to PPI, and we weren't sure why. Karen was convinced that PPI did better with characters, but I had seen it hit a fatal bug that using octets avoided. Eventually, with the help of git blame and IRC logs, we determined that the problem was... a byte order mark. Worse yet, a BOM on UTF-8!

When parsing a string, PPI does in fact expect characters, but does not expect that the first one might be U+FEFF, the space character used at offset 0 in files to indicate the UTF encoding type. Perl's UTF-16 encoding layers will notice and use the BOM, but the UTF-8 layer will not, because a BOM on a UTF-8 file is a bad idea. Rather than try to do anything incredibly clever, I did something quite crude: I strip off leading U+FEFF when reading UTF-8 files and, sometimes, strings. Although this isn't always correct, I feel pretty confident that anybody who has put a literal ZERO WIDTH NO-BREAK SPACE in their code is going to deserve whatever they get.

With that done, a bunch of encoding issues go away and you can once again use Dist::Zilla on code like:

my $π = 22 / 7;

This also led to some fixes for Unicode text in __DATA__ sections. As with the Path::Tiny change, a number of downstream plugins were affected in one way or another, and I did my best to mitigate the damage. In most cases, anything broken was only working accidentally before.

Dist::Zilla v6.003 is currently in trial on CPAN, and a few more releases with a few more changes will happen before v6 is stable.

Oh, and it requires perl v5.14.0 now. That's perl from about five years ago.

Meetings!

I was in a number of meetings, but I'm only going to mention one: the Test2 meeting. We wanted to discuss the way forward for Test2 and Test::Builder. I think this needs an entire post of its own, which I'll try to get to soon. In short, the majority view in the room was that we should merge Test2 into Test-Simple and carry on. I am looking forward to this upgrade.

Other meetings included:

  • renaming the QAH (I'm not excited either way)
  • using Test2 directly in core for speed (turns out it's not a big win)
  • getting more review of issues filed on Software-License

A bit more about that last one: I wrote Software-License, and I feel I've done as much work on it as I care to, at least in the large. Now it gets a steady trickle of issues, and I'm not excited to keep doing it all myself. I recruited some helpers, but mostly nothing has come of it. I tried to rally the troops a bit to encourage more regular review just leading to each person giving a +1 or -1 on each pull request. Otherwise, Software-License is likely to languish.

Pairing!

I really enjoy being "free floating helper guy" at QAH. It's something I've done a lot ever since Oslo. What I mean is this: I look around for people who look frustrated and say, "Hey, how's it going?" Then they say what's up, and we talk about the issue. Sometimes, they just need to say things out loud, and I'm their rubber duck. Other times, we have a real discussion about the problem, do a bit of pair programming, debate the benefits of different options, or whatever. Even when I'm not really involved in the part of the toolchain being worked on, I feel like I have been able to contribute a lot this way, and I know it makes me more valuable to the group in general, because it leaves me with more of an understanding of more parts of the system.

This time, I was involved in review, pairing, or discussion on:

  • fixing Pod::Simple::Search with Neil Bowers
  • testing PAUSE web stuff with Pete Sergeant
  • CPAN::Reporter client code with Breno G. de Oliveira
  • DZ plugin issues with Karen Etheridge
  • Log::Dispatchouli logging problems with Sawyer X
  • PAUSE permissions updates with Neil Bowers
  • PAUSE indexing updates with Colin Newell (I owe him more code review!)
  • improvements to PAUSE's testing tools with Matthew Horsfall
  • PPI improvements with Matthew Horsfall

...and probably other things I've already forgotten.

more hard work

Pumpking Updates

A few weeks ago, I announced that I'm retiring as pumpking after a good four and a half years on the job. On the second night of the hackathon, the day ended with a few people saying some very nice things about me and giving me both a lovely "Silver Camel" award and also a staggering collection of bottles of booze. I had to borrow some extra luggage to bring it all home. (Also, a plush camel, a very nice hardbound notebook, and a book on cocktails!) I was asked to say something, and tried my best to sound at least slightly articulate.

Meanwhile, there was a lot of discussion going on — a bit at the QAH but more via email — about who would be taking over. In the end, Sawyer X agreed to take on the job. The reaction from the group, when this was announced, was strong and positive, except possibly from Sawyer himself, as he quickly fled the room, presumably to consider his grave mistake. He did not, however, recant.

perl v5.24.0

I didn't want to spend too much time on perl v5.24.0 at the QAH, but I did spend a bit, rolling out RC2 after discussing Configure updates with Tux and (newly-minted Configure expert) Aaron Crane. I'm hoping that we'll have v5.24.0 final in about a week.

Perl QAH 2017

I'm definitely looking forward to next year's QAH, wherever it may be. This year, I had hoped to do some significant refactoring of the internals of PAUSE, but as the QAH approached, I realized that this was a task I'd need to plan ahead for. I'm hoping that between now and QAH 2017, I can develop a plan to rework the guts to make them easier to unit test and then to re-use.

Thanks, sponsors!

The QAH is a really special event, in that most of the attendees are brought to it on the sponsor's dime. It's not a conference or a fun code jam, but a summit paid for by people and corporations who know they'll benefit from it. There's a list of all the sponsors on the event page, including, but not limited to:

Raise a glass to them!

Sawyer X: My first QAH (QA Hackathon)

The QA Hackathon gathers various major contributors to the core infrastructure of the Perl language and the CPAN. People who have immense responsibility and whose contributions fuel our work and our businesses.

The QAH is a good opportunity for these contributors to discuss important topics, reach solutions or make decisions on how to move forward, and to hack on all the core infrastructure.

I was invited to this QAH hoping I will be able to contribute. While I finally got to see a lot of people I missed and had the pleasure of meeting people I've always wished, I also tried to work on several things, all described in the results page. I'd like to get into a bit of details regarding the work I did. Let's hope it was useful enough to justify my attendance.

There were several discussions and I tried to attend all (eventually being able to attend most). These included Test2 (of which Chad Granum wrote), the QAH itself, and the CPAN River (of which Neil Bowers had written at length). I also led one discussion about how to package Perl modules and applications, crossing between operating systems and languages. I hope it went well.

The majority of my time I worked with Leo Lapworth (Ranguard), Olaf Alders, and Mickey Nasriachi. While we are all part of the MetaCPAN core (Leo and Olaf in earnest, Mickey and myself by working on MetaCPAN::Client), Mickey and I still had only preliminary understanding of the internal code-base (okay, just me!) and we finally had a chance to work together on porting MetaCPAN to a significantly updated Elastic Search. This is crucial. I am happy I got to work with them on it, and I believe we've made substantial progress. I also got to update MetaCPAN::Client which will get released shortly, so that's good.

I also worked on packaging some more. I hope to reach a point I could write separately on this, but it's still much too early for that. Thanks to mst for putting up with my unending questions on logging and Log::Contextual. "Why does it do this?", "Why not this?", "Log::Dispatch does this", "Log::Dispatchouli does that", "Why doesn't this work?" - Matt sure has patience!

I got to work with Aaron Crane, which was very fun for me. We looked at Ref::Util. Aaron provided a PR to have some plain explicitly non-blessed implementations of Ref::Util's functions. I merged those and together we looked at an interesting issue, raised by Yves Orton, about regular expressions. They require checking magic so their implementation in my module was fairly naive, and by "naive", I mean "wrong". We fixed that for Perl 5.10 and above (both in the XS version and in the optimized opcodes version).

We then discussed, along with Matthew (wolfsage) Horsfall adding the macro (SvRXOK) to Devel::PPPort so we could use it to support the same functionality but for Perl versions 5.6 and 5.8. Even if it doesn't happen, I got to work with Aaron and see his archaeological prowess in action. Also, I finally had a reason to ping Matthew. :)

On the same topic, mst, Aaron, and I discussed the issues with blessed vs. non-blessed references and how tricky this can be. We discussed the most common practice and what we can do to improve. We decided to introduce more plain functions and make the blessed ones separate. There will be additional posts on this topic, I assure you. Thank you, mst and Aaron, for exemplifying how easily one (or two) can convince with the right approach!

Ishigaki-san found that I am in charge of using a very bad practice in JSON in a Dancer plugin I wrote years ago and didn't even know. We looked together at how to cleanly fix it, went through the tests, fixed, and released a new version on the spot. I provided a pull request for the equivalent module in Dancer2.

And within all of this, once in a while, Abe Timmerman consulted with me about a feature request I had regarding Perl core tests. He was working on making it easier to see when there's a change in platform tests, which cannot be easily seen right now. The interface Abe is working on allows us to see that a smoker which had previous failed is now passing.

I want to thank the organizers: Barbie, JJ, and Neil. You did an incredible job. I want to also thank all of the sponsors for the QAH - it could not have been done without your help.

Sponsors for the Perl QA Hackathon 2016

Perl News: New Perl Pumpking

Ricardo Signes has stepped down from the role of Perl Pumpkin after four and a half years in the role — and an unprecedented five stable 5.x.0 releases of Perl.

The role has been passed on to Sawyer, who has laid out his plans for maintaining the great work the community is already doing.

The community owes a lots to Ricardo. We wish Sawyer all the best for the work to come.

Sawyer X: A pumpking is born

As some of you might have heard, Ricardo Signes had stepped down from the role of Perl Pumpking. He passed on that role to me.

I have written the following message to Perl 5 Porters, and I think it seems apt to provide it at large to the community here:

("group" in this context is the Perl 5 core development group: p5p.)

It's very hard for me to summarize what I think or how I feel about this. I'm deeply honored by the trust instilled in me by both Ricardo and the group. This means a lot and I intend to work hard to justify it.

I'm still getting to grips with what this role entails. I have a lot of emails to read, quite a few to write. Releases will continue as planned. I hope to make the process even easier. I hope to introduce more people to the guts of Perl and have more hands on board. I also fully intend to continue releasing the p5p summaries, possibly even more regularly.

For what it's worth, I have no great plans when it comes to changes to the Perl core. Some people are afraid that since Ricardo is stepping down and a less-experienced person is stepping up, things will hang and items might stall. Others are afraid that now everything will change because my head was mostly in higher-churn code and I might not see the possible risks. I want to clarify that my role will be centered in allowing and making it easier for others to work, have discussions, and reach decisions.

Given that, I also hope to receive the help of the group in doing this. My technical abilities fall short of countless many and I trust in the knowledge and experience of my peers. My social skill could use a bit of help too. :)

One thing I will continue to maintain is the civility of the list, making it a place on which we can have technical conversations without worrying about abusive language or behavior.

That's all I have for now, and thank you again, from the bottom of my heart.

Sawyer. <3

sawyer-the-pumpking.jpg

Aristotle: A very stupid, over-clever scoping-based importing trick

In some code I’m working on, I use a module which exposes a whole bunch of package variables as part of its public interface. It does not export them, however. I could refer to them directly by prefixing them all with Some::Module::, but that would be ugly and verbose. It’s also unsafe – the vars stricture will not help you catch typos if you use fully qualified names.

The obvious solution would be to emulate what exporting does:

our ($bar, @baz, %quux);
*bar  = \$Some::Module::bar;
*baz  = \@Some::Module::baz;
*quux = \%Some::Module::quux;

But this forces me to mention every single variable name thrice, just so I can declare it.

Another option would be to just use references stored in lexicals:

my $bar  = \$Some::Module::bar;
my $baz  = \@Some::Module::baz;
my $quux = \%Some::Module::quux;

That does remove one point of repetition. But it does so in exchange for having to strew arrows or double sigils all over the code.

Either way, the package name must be repeated over and over and over. This is possible to fix with a fairly innocuously stupid trick:

our ($bar, @baz, %quux);
(*bar, *baz, *quux) = do {
    package Some::Module;
    \($bar, @baz, %quux);
};

But this is, if anything, uglier than before. It does, however, allow for a nice improvement under recent perls (namely, given the refaliasing feauture):

\our ($bar, @baz, %quux) = do {
    package Some::Module;
    \($bar, @baz, %quux);
};

That’s not so bad any more.

But it’s still repetitive. It’s not nice enough that I would use it when I’m taking a shortcut, and it’s not obvious enough that I would use it when I’m aiming for maintainability.

It is possible to go further, though.

Do you know what our does? Do you know exactly? Let me tell you: it declares a lexically scoped variable that aliases the package variable of the same name. Packages and lexical scopes are orthogonal concepts. The current package can change any number of times within a scope. And a lexical variable declared during a scope will be visible during the entire rest of the scope, regardless of the current package at any given point. Which means…

package Some::Module;
our ($bar, @baz, %quux);
package main;

… this works. After changing the current package to the one we want, lexically scoped aliases declared with our refer to package variables in that package. When we then switch the current package back, we remain in the same scope, so the lexicals we just declared remain visible.

The only irritating part is that the switch back has to explicitly name the package, something that was not necessary with the do { package ...; ... } construct above. Unfortunately we cannot block-scope the package switch like that, because that would also block-scope the lexicals, and we need them to survive. That was, after all, the whole point of the exercise.

So it can’t be made perfect, but it’s very close: every identifier involved (the package name and each variable of interest) is named exactly once, with exactly one redundant mention of an identifier not relevant to the intention of the code.

Clearly this… technique is too obscure to use in code that needs maintenance. But in throw-away scripts it might just come in handy.

Ranguard's blog: Perl QAH and MetaCPAN

This year was my first Perl Quality Assurance Hackathon, and even then I could only make 2 of the 4 days. I now wish I'd been to everyone ever, and for the full time!

I've been working on the MetaCPAN project for over 4 years, taking on the puppet master/sysadmin/development VM builder type role, even though that's not really my day job. So after all this time to actually meet Olaf Alders and the recently joined Mickey was a great pleasure:

We have been working towards a massive upgrade of Elasticsearch (from 0.2 to 2.3!) which will give us clustering and all round general improvements. This has been a long long project and being able to sit together and plow through issues was fantastic. We kept a overview of what we wanted to get done, and there are a lot of ticks on there.

There is still more, but it's now close!

I mentioned to everyone that I could help with speeding up websites, the cpantesters team wanted to know more, as of this morning they are now using Fastly so the site should be quicker to access around the world. But this should also give some other long term offloading benefits that they are now working on.

Paul Johnson and I also discussed various aspects of http://cpancover.com/, the first of which will be for the MetaCPAN infrastructure to act as a backup of his data so if his box goes down it doesn't take a week to recover. There is also a plan to link to the results from the MetaCPAN site.

I also met an amazing designer and she is now helping on several projects... more on that another time.

Thanks to Wendy for amazing Strawberries and support, Neil and the others for such fantastic organisation and everyone else who attended that made it such a successful event.

I'd also like to thank Foxtons, my work for sending me (as of writing we are hiring!) and also all the other sponsors for making such an amazing and productive event possible:

FastMail,
ActiveState,
ZipRecruiter,
Strato,
SureVoIP,
CV-Library,
thinkproject!,
MongoDB,
Infinity,
Dreamhost,
Perl 6,
Perl Careers,
Evozon,
Booking,
Eligo,
Oetiker+Partner,
CAPSiDE,
Perl Services,
Procura,
Constructor.io,
Robbie Bow,
Ron Savage,
Charlie Gonzalez,
Justin Cook,

Ricardo Signes: Dist::Zilla v6 is here (in trial format)

I've been meaning to release Dist::Zilla v6 for quite a while now, but I've finally done it as a trial release. I'll make a stable release in a week or two. So far, I see no breakage, which is about what I expected. Here's the scoop:

Path::Class has been dropped for Path::Tiny.

Actually, you get a subclass of Path::Tiny. This isn't really supported behavior. In fact, Path::Tiny tells you not to do this. It won't be here long, though, and it only needs to work one level deep, which it does. It's just enough to give people downstream a warning instead of an exception. A lot of the grotty work of updating the internals to use Path::Tiny methods instead of Path::Class methods was done by Kent Frederic. Thanks, Kent!

-v no longer takes an argument

It used to be that dzil test -v put things in verbose mode, dzil test -v Plugin put just that plugin in verbose mode, and dzil -v test screwed up because it decided you meant test as a plugin name, and then couldn't figure out what command to run.

Now -v is all-things-verbose and -V is one plugin. It turns out that single-plugin verbosity has been broken for some time, and still is. I'll fix it very soon.

Deprecated plugins deleted

I've removed [Prereq] and [AutoPrereq] and [BumpVersion]. These were long marked as deprecated. The first two are just old spellings of the now-canonically-plural versions. BumpVersion is awful and nobody should use it ever.

PkgVersion can generate "package NAME VERSION" lines

So, now you can avoid deciding how to assign to $VERSION and add the version number directly to the package declaration. This also avoids the need to have any room for blank lines in which to add $VERSION.

Dist::Zilla now requires v5.14.0

Party like it's 2011.

Dave's Free Press: Journal: Module pre-requisites analyser

Dave's Free Press: Journal: CPANdeps

Dave's Free Press: Journal: Perl isn't dieing

Perlgeek.de : Introducing Go Continuous Delivery

Go Continuous Delivery (short GoCD or simply Go) is an open source tool that controls an automated build or deployment process.

It consists of a server component that holds the pipeline configuration, polls source code repositories for changes, schedules and distributes work, collects artifacts, and presents a web interface to visualize and control it all, and offers a mechanism for manual approval of steps. One or more agents can connect to the server, and carry out the actual jobs in the build pipeline.

Pipeline Organization

Every build, deployment or test jobs that GoCD executes must be part of a pipeline. A pipeline consists of one or more linearly arranged stages. Within a stage, jobs run potentially in parallel, and are individually distributed to agents. Tasks are again linearly executed within a job. The most general task is the execution of an external program. Other tasks include the retrieval of artifacts, or specialized things such as running a Maven build.

Matching of Jobs to Agents

When an agent is idle, it polls the server for work. If the server has jobs to run, it uses two criteria to decide if the agent is fit for carrying out the job: environments and resources.

Each job is part of a pipeline, and a pipeline is part of an environment. On the other hand, each agent is configured to be part of one or more environments. An agent only accepts jobs from pipelines from one of its environments.

Resources are user-defined labels that describe what an agent has to offer, and inside a pipeline configuration, you can specify what resources a job needs. For example you can define that job requires the phantomjs resource to test a web application, then only agents that you assign this resource will execute that job. It is also a good idea to add the operating system and version as a resources. In the example above, the agent might have the phantomjs, debian and debian-jessie resources, offering the author of the job some choice of granularity for specifying the required operating system.

Installing the Go Server on Debian

To install the Go server on a Debian or Debian-based operating system, first you have to make sure you can download Debian packages via HTTPS:

$ apt-get install -y apt-transport-https

Then you need to configure the package sourcs:

$ echo 'deb http://dl.bintray.com/gocd/gocd-deb/ /' > /etc/apt/sources.list.d/gocd.list
$ curl https://bintray.com/user/downloadSubjectPublicKey?username=gocd | apt-key add -

And finally install it:

$ apt-get update && apt-get install -y go-server

When you now point your browser at port 8154 of the go server for HTTPS (ignore the SSL security warnings) or port 8153 for HTTP, you should see to go server's web interface:

To prevent unauthenticated access, create a password file (you need to have the apache2-utils package installed to have the htpasswd command available) on the command line:

$ htpasswd -c -s /etc/go-server-passwd go-admin
New password:
Re-type new password:
Adding password for user go-admin
$ chown go: /etc/go-server-passwd
$ chmod 600 /etc/go-server-passwd

In the go web interface, click on the Admin menu and then "Server Configuration". In the "User Management", enter the path /etc/go-server-passwd in the field "Password File Path" and click on "Save" at the bottom of the form.

Immediately afterwards, the go server asks you for username and password.

You can also use LDAP or Active Directory for authentication.

Installing a Go Worker on Debian

On one or more servers where you want to execute the automated build and deployment steps, you need to install a go agent, which will connect to the server and poll it for work. On each server, you need to do the first same three steps as when installing the server, to ensure that you can install packages from the go package repository. And then, of course, install the go agent:

$ apt-get install -y apt-transport-https
$ echo 'deb http://dl.bintray.com/gocd/gocd-deb/ /' > /etc/apt/sources.list.d/gocd.list
$ curl https://bintray.com/user/downloadSubjectPublicKey?username=gocd | apt-key add -
$ apt-get update && apt-get install -y go-agent

Then edit the file /etd/default/go-agent. The first line should read

GO_SERVER=127.0.0.1

Change the right-hand side to the hostname or IP address of your go server, and then start the agent:

$ service go-agent start

After a few seconds, the agent has contacted the server, and when you click on the "Agents" menu in the server's web frontend, you should see the agent:

("lara" is the host name of the agent here).

A Word on Environments

Go makes it possible to run agents in specific environments, and for example run a go agent on each testing and on each production machine, and use the matching of pipelines to agent environments to ensure that for example an installation step happens on the right machine in the right environment. If you go with this model, you can also use Go to copy the build artifacts to the machines where they are needed.

I chose not to do this, because I didn't want to have to install a go agent on each machine that I want to deploy to. Instead I use Ansible, executed on a Go worker, to control all machines in an environment. This requires managing the SSH keys that Ansible uses, and distributing packages through a Debian repository. But since Debian seems to require a repository anyway to be able to resolve dependencies, this is not much of an extra hurdle.

So don't be surprised when the example project here only uses a single environment in Go, which I call Control.

First Contact with Go's XML Configuration

There are two ways to configure your Go server: through the web interface, and through a configuration file in XML. You can also edit the XML config through the web interface.

While the web interface is a good way to explore go's capabilities, it quickly becomes annoying to use due to too much clicking. Using an editor with good XML support get things done much faster, and it lends itself better to compact explanation, so that's the route I'm going here.

In the Admin menu, the "Config XML" item lets you see and edit the server config. This is what a pristine XML config looks like, with one agent already registered:

<?xml version="1.0" encoding="utf-8"?>
<cruise xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="cruise-config.xsd" schemaVersion="77">
<server artifactsdir="artifacts" commandRepositoryLocation="default" serverId="b2ce4653-b333-4b74-8ee6-8670be479df9">
    <security>
    <passwordFile path="/etc/go-server-passwd" />
    </security>
</server>
<agents>
    <agent hostname="lara" ipaddress="192.168.2.43" uuid="19e70088-927f-49cc-980f-2b1002048e09" />
</agents>
</cruise>

The ServerId and the data of the agent will differ in your installation, even if you followed the same steps.

To create an environment and put the agent in, add the following section somewhere within <cruise>...</cruise>:

<environments>
    <environment name="Control">
    <agents>
        <physical uuid="19e70088-927f-49cc-980f-2b1002048e09" />
    </agents>
    </environment>
</environments>

(The agent UUID must be that of your agent, not of mine).

To give the agent some resources, you can change the <agent .../> tag in the <agents> section to read:

<agent hostname="lara" ipaddress="192.168.2.43" uuid="19e70088-927f-49cc-980f-2b1002048e09">
  <resources>
    <resource>debian-jessie</resource>
    <resource>build</resource>
    <resource>debian-repository</resource>
  </resources>
</agent>

Creating an SSH key

It is convenient for Go to have an SSH key without password, to be able to clone git repositories via SSH, for example.

To create one, run the following commands on the server:

$ su - go $ ssh-keygen -t rsa -b 2048 -N '' -f ~/.ssh/id_rsa

And either copy the resulting .ssh directory and the files therein onto each agent into the /var/go directory (and remember to set owner and permissions as they were created originally), or create a new key pair on each agent.

Ready to Go

Now that the server and an agent has some basic configuration, it is ready for its first pipeline configuration. Which we'll get to soon :-).


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 3

Perlgeek.de : Managing State in a Continuous Delivery Pipeline

Continuous Delivery is all nice and fluffy for a stateless application. Installing a new application is a simple task, which just needs installing of the new binaries (or sources, in case of a language that's not compiled), stop the old instance, and start a new instance. Bonus points for reversing the order of the last two steps, to avoid downtimes.

But as soon as there is persistent state to consider, things become more complicated.

Here I will consider traditional, relational databases with schemas. You can avoid some of the problems by using a schemaless "noSQL" database, but you don't always have that luxury, and it doesn't solve all of the problems anyway.

Along with the schema changes you have to consider data migrations, but they aren't generally harder to manage than schema changes, so I'm not going to consider them in detail.

Synchronization Between Code and Database Versions

State management is hard because code is usually tied to a version of the database schema. There are several cases where this can cause problems:

  • Database changes are often slower than application updates. In version 1 of your application can only deal with version 1 of the schema, and version 2 of the application can only deal with version 2 of the schema, you have to stop the application in version 1, do the database upgrade, and start up the application only after the database migration has finished.
  • Stepbacks become painful. Typically either a database change or its rollback can lose data, so you cannot easily do an automated release and stepback over these boundaries.

To elaborate on the last point, consider the case where a column is added to a table in the database. In this case the rollback of the change (deleting the column again) loses data. On the other side, if the original change is to delete a column, that step usually cannot be reversed; you can recreate a column of the same type, but the data is lost. Even if you archive the deleted column data, new rows might have been added to the table, and there is no restore data for these new rows.

Do It In Multiple Steps

There is no tooling that can solve these problems for you. The only practical approach is to collaborate with the application developers, and break up the changes into multiple steps (where necessary).

Suppose your desired change is to drop a column that has a NOT NULL constraint. Simply dropping the column in one step comes with the problems outlined above. In a simple scenario, you might be able to do the following steps instead:

  • Deploy a database change that makes the column nullable (or give it a default value)
  • Wait until you're sure you don't want to roll back to a version where this column is NOT NULL
  • Deploy a new version of the application that doesn't use the column anymore
  • Wait until you're sure you don't want to roll back to a version of your application that uses this column
  • Deploy a database change that drops the column entirely.

In a more complicated scenario, you might first need to a deploy a version of your application that can deal with reading NULL values from this column, even if no code writes NULL values yet.

Adding a column to a table works in a similar way:

  • Deploy a database change that adds the new column with a default value (or NULLable)
  • Deploy a version of the application that writes to the new column
  • optionally run some migrations that fills the column for old rows
  • optionally deploy a database change that adds constraints (like NOT NULL) that weren't possible at the start

... with the appropriate waits between the steps.

Prerequisites

If you deploy a single logical database change in several steps, you need to do maybe three or four separate deployments, instead of one big deployment that introduces both code and schema changes at once. That's only practical if the deployments are (at least mostly) automated, and if the organization offers enough continuity that you can actually actually finish the change process.

If the developers are constantly putting out fires, chances are they never get around to add that final, desired NOT NULL constraint, and some undiscovered bug will lead to missing information later down the road.

Tooling

Unfortunately, I know of no tooling that supports the inter-related database and application release cycle that I outlined above.

But there are tools that manage schema changes in general. For example sqitch is a rather general framework for managing database changes and rollbacks.

On the lower level, there are tools like pgdiff that compare the old and new schema, and use that to generate DDL statements that bring you from one version to the next. Such automatically generated DDLs can form the basis of the upgrade scripts that sqitch then manages.

Some ORMs also come with frameworks that promise to manage schema migrations for you. Carefully evaluate whether they allow rollbacks without losing data.

No Silver Bullet

There is no single solution that manages all your data migrations automatically for your during your deployments. You have to carefully engineer the application and database changes to decouple them a bit. This is typically more work on the application development side, but it buys you the ability to deploy and rollback without being blocked by database changes.

Tooling is available for some pieces, but typically not for the big picture. Somebody has to keep track of the application and schema versions, or automate that.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Perlgeek.de : Automating Deployments: Distributing Debian Packages with Aptly

Once a Debian package is built, it must be distributed to the servers it is to be installed on.

Debian, as well as all other operating systems I know of, use a pull model for that. That is, the package and its meta data are stored on a server that the client can contact, and request the meta data and the package.

The sum of meta data and packages is called a repository. In order to distribution packages to the servers that need them, we must set up and maintain such a repository.

Signatures

In Debian land, packages are also signed cryptographically, to ensure packages aren't tampered with on the server or during transmission.

So the first step is to create a key pair that is used to sign this particular repository. (If you already have a PGP key for signing packages, you can skip this step).

The following assumes that you are working with a pristine system user that does not have a gnupg keyring yet, and which will be used to maintain the debian repository. It also assumes you have the gnupg package installed.

$ gpg --gen-key

This asks a bunch of questions, like your name and email address, key type and bit width, and finally a pass phrase. I left the pass phrase empty to make it easier to automate updating the repository, but that's not a requirement.

$ gpg --gen-key
gpg (GnuPG) 1.4.18; Copyright (C) 2014 Free Software Foundation, Inc.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

gpg: directory `/home/aptly/.gnupg' created
gpg: new configuration file `/home/aptly/.gnupg/gpg.conf' created
gpg: WARNING: options in `/home/aptly/.gnupg/gpg.conf' are not yet active during this run
gpg: keyring `/home/aptly/.gnupg/secring.gpg' created
gpg: keyring `/home/aptly/.gnupg/pubring.gpg' created
Please select what kind of key you want:
   (1) RSA and RSA (default)
   (2) DSA and Elgamal
   (3) DSA (sign only)
   (4) RSA (sign only)
Your selection? 1
RSA keys may be between 1024 and 4096 bits long.
What keysize do you want? (2048) 
Requested keysize is 2048 bits
Please specify how long the key should be valid.
         0 = key does not expire
      <n>  = key expires in n days
      <n>w = key expires in n weeks
      <n>m = key expires in n months
      <n>y = key expires in n years
Key is valid for? (0) 
Key does not expire at all
Is this correct? (y/N) y
You need a user ID to identify your key; the software constructs the user ID
from the Real Name, Comment and Email Address in this form:
    "Heinrich Heine (Der Dichter) <heinrichh@duesseldorf.de>"

Real name: Aptly Signing Key
Email address: automatingdeployments@gmail.com
You selected this USER-ID:
    "Moritz Lenz <automatingdeployments@gmail.com>"

Change (N)ame, (C)omment, (E)mail or (O)kay/(Q)uit? O
You need a Passphrase to protect your secret key.

You don't want a passphrase - this is probably a *bad* idea!
I will do it anyway.  You can change your passphrase at any time,
using this program with the option "--edit-key".

We need to generate a lot of random bytes. It is a good idea to perform
some other action (type on the keyboard, move the mouse, utilize the
disks) during the prime generation; this gives the random number
generator a better chance to gain enough entropy.
..........+++++
.......+++++

Not enough random bytes available.  Please do some other work to give
the OS a chance to collect more entropy! (Need 99 more bytes)
..+++++
gpg: /home/aptly/.gnupg/trustdb.gpg: trustdb created
gpg: key 071B4856 marked as ultimately trusted
public and secret key created and signed.

gpg: checking the trustdb
gpg: 3 marginal(s) needed, 1 complete(s) needed, PGP trust model
gpg: depth: 0  valid:   1  signed:   0  trust: 0-, 0q, 0n, 0m, 0f, 1u
pub   2048R/071B4856 2016-01-10
      Key fingerprint = E80A D275 BAE1 DEDE C191  196D 078E 8ED8 071B 4856
uid                  Moritz Lenz <automatingdeployments@gmail.com>
sub   2048R/FFF787F6 2016-01-10

Near the bottom the line starting with pub contains the key ID:

pub   2048R/071B4856 2016-01-10

We'll need the public key later, so it's best to export it:

$ gpg --export --armor 071B4856 > pubkey.asc

Preparing the Repository

There are several options for managing Debian repositories. My experience with debarchiver is mixed: Once set up, it works, but it does not give immediate feedback on upload; rather it communicates the success or failure by email, which isn't very well-suited for automation.

Instead I use aptly, which works fine from the command line, and additionally supports several versions of the package in one repository.

To initialize a repo, we first have to come up with a name. Here I call it internal.

$ aptly repo create -distribution=jessie -architectures=amd64,i386,all -component=main internal

Local repo [internal] successfully added.
You can run 'aptly repo add internal ...' to add packages to repository.

$ aptly publish repo -architectures=amd64,i386,all internal
Warning: publishing from empty source, architectures list should be complete, it can't be changed after publishing (use -architectures flag)
Loading packages...
Generating metadata files and linking package files...
Finalizing metadata files...
Signing file 'Release' with gpg, please enter your passphrase when prompted:
Clearsigning file 'Release' with gpg, please enter your passphrase when prompted:

Local repo internal has been successfully published.
Please setup your webserver to serve directory '/home/aptly/.aptly/public' with autoindexing.
Now you can add following line to apt sources:
  deb http://your-server/ jessie main
Don't forget to add your GPG key to apt with apt-key.

You can also use `aptly serve` to publish your repositories over HTTP quickly.

As the message says, there needs to be a HTTP server that makes these files available. For example an Apache virtual host config for serving these files could look like this:

<VirtualHost *:80>
        ServerName apt.example.com
        ServerAdmin moritz@example.com

        DocumentRoot /home/aptly/.aptly/public/
        <Directory /home/aptly/.aptly/public/>
                Options +Indexes +FollowSymLinks

                Require all granted
        </Directory>

        # Possible values include: debug, info, notice, warn, error, crit,
        # alert, emerg.
        LogLevel notice
        CustomLog /var/log/apache2/apt/access.log combined
        ErrorLog /var/log/apache2/apt/error.log
        ServerSignature On
</VirtualHost>

After creating the logging directory (mkdir -p /var/log/apache2/apt/), enabling the the virtual host (a2ensite apt.conf) and restarting Apache, the Debian repository is ready.

Adding Packages to the Repository

Now that the repository is set up, you can add a package by running

$ aptly repo add internal package-info_0.1-1_all.deb
$ aptly publish update internal

Configuring a Host to use the Repository

Copy the PGP public key with which the repository is signed (pubkey.asc) to the host which shall use the repository, and import it:

$ apt-key add pubkey.asc

Then add the actual package source:

$ echo "deb http://apt.example.com/ jessie main" > /etc/apt/source.list.d/internal

After an apt-get update, the contents of the repository are available, and an apt-cache policy package-info shows the repository as a possible source for this package:

$ apt-cache policy package-info
package-info:
  Installed: (none)
  Candidate: 0.1-1
  Version table:
 *** 0.1-1 0
        990 http://apt.example.com/ jessie/main amd64 Packages
        100 /var/lib/dpkg/status

This concludes the whirlwind tour through debian repository management and thus package distribution. Next up will be the actual package installation.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Dave's Free Press: Journal: Devel::CheckLib can now check libraries' contents

Ocean of Awareness: Top-down parsing is guessing

Top-down parsing is guessing. Literally. Bottom-up parsing is looking.

The way you'll often hear that phrased is that top-down parsing is looking, starting at the top, and bottom-up parsing is looking, starting at the bottom. But that is misleading, because the input is at the bottom -- at the top there is nothing to look at. A usable top-down parser must have a bottom-up component, even if that component is just lookahead.

A more generous, but still accurate, way to describe the top-down component of parsers is "prediction". And prediction is, indeed, a very useful component of a parser, when used in combination with other techniques.

Of course, if a parser does nothing but predict, it can predict only one input. Top-down parsing must always be combined with a bottom-up component. This bottom-up component may be as modest as lookahead, but it must be there or else top-down parsing is really not parsing at all.

So why is top-down parsing used so much?

Top-down parsing may be unusable in its pure form, but from one point of view that is irrelevant. Top-down parsing's biggest advantage is that it is highly flexible -- there's no reason to stick to its "pure" form.

A top-down parser can be written as a series of subroutine calls -- a technique called recursive descent. Recursive descent allows you to hook in custom-written bottom-up logic at every top-down choice point, and it is a technique which is completely understandable to programmers with little or no training in parsing theory. When dealing with recursive descent parsers, it is more useful to be a seasoned, far-thinking programmer than it is to be a mathematician. This makes recursive descent very appealing to seasoned, far-thinking programmers, and they are the audience that counts.

Switching techniques

You can even use the flexibility of top-down to switch away from top-down parsing. For example, you could claim that a top-down parser could do anything my own parser (Marpa) could do, because a recursive descent parser can call a Marpa parser.

A less dramatic switchoff, and one that still leaves the parser with a good claim to be basically top-down, is very common. Arithmetic expressions are essential for a computer language. But they are also among the many things top-down parsing cannot handle, even with ordinary lookahead. Even so, most computer languages these days are parsed top-down -- by recursive descent. These recursive descent parsers deal with expressions by temporarily handing control over to an bottom-up operator precedence parser. Neither of these parsers is extremely smart about the hand-over and hand-back -- it is up to the programmer to make sure the two play together nicely. But used with caution, this approach works.

Top-down parsing and language-oriented programming

But what about taking top-down methods into the future of language-oriented programming, extensible languages, and grammars which write grammars? Here we are forced to confront the reality -- that the effectiveness of top-down parsing comes entirely from the foreign elements that are added to it. Starting from a basis of top-down parsing is literally starting with nothing. As I have shown in more detail elsewhere, top-down techniques simply do not have enough horsepower to deal with grammar-driven programming.

Perl 6 grammars are top-down -- PEG with lots of extensions. These extensions include backtracking, backtracking control, a new style of tie-breaking and lots of opportunity for the programmer to intervene and customize everything. But behind it all is a top-down parse engine.

One aspect of Perl 6 grammars might be seen as breaking out of the top-down trap. That trick of switching over to a bottom-up operator precedence parser for expressions, which I mentioned above, is built into Perl 6 and semi-automated. (I say semi-automated because making sure the two parsers "play nice" with each other is not automated -- that's still up to the programmer.)

As far as I know, this semi-automation of expression handling is new with Perl 6 grammars, and it may prove handy for duplicating what is done in recursive descent parsers. But it adds no new technique to those already in use. And features like

  • mulitple types of expression, which can be told apart based on their context,
  • n-ary expressions for arbitrary n, and
  • the autogeneration of multiple rules, each allowing a different precedence scheme, for expressions of arbitrary arity and associativity,

all of which are available and in current use in Marpa, are impossible for the technology behind Perl 6 grammars.

I am a fan of the Perl 6 effort. Obviously, I have doubts about one specific set of hopes for Perl 6 grammars. But these hopes have not been central to the Perl 6 effort, and I will be an eager student of the Perl 6 team's work over the coming months.

Comments

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Perlgeek.de : Automating Deployments: Simplistic Deployment with Git and Bash

One motto of the Extreme Programming movement is to do the simplest thing that can possibly work, and only get more fancy when it is necessary.

In this spirit, the simplest deployment option for some projects is to change the working directory in a clone of the project's git repository, and run

git pull

If this works, it has a certain beauty of mirroring pretty much exactly what developers do in their development environment.

Reality kicks in

But it only works if all of these conditions are met:

  • There is already a checkout of the git repository, and it's configured correctly.
  • There are no local changes in the git repository.
  • There were no forced updates in the remote repository.
  • No additional build or test step is required.
  • The target machine has git installed, and both network connection to and credentials for the git repository server.
  • The presence of the .git directory poses no problem.
  • No server process needs to be restarted.
  • No additional dependencies need to be installed.

As an illustration on how to attack some of these problems, let's consider just the second point: local modifications in the git repository. It happens, for example when people try out things, or do emergency fixes etc. git pull does a fetch (which is fine), and a merge. Merging is an operation that can fail (for example if local uncommitted changes or local commits exists) and that requires manual intervention.

Manual changes are a rather bad thing to have in an environment where you want to deploy automatically. Their presence leave you two options: discard them, or refuse to deploy. If you chose the latter approach, git pull --ff-only is a big improvement; this will only do the merge if it is a trivial fast-forward merge, that is a merge where the local side didn't change at all. If that's not the case (that is, a local commit exists), the command exits with a non-zero return value, which the caller should interpret as a failure, and report the error somehow. If it's called as part of a cron job, the standard approach is to send an email containing the error message.

If you chose to discard the changes instead, you could do a git stash for getting rid of uncommitted changes (and at the same time preserving them for a time in the deps of the .git directory for later inspection), and doing a reset or checkout instead of the merge, so that the command sequence would read:

set -e
git fetch origin
git checkout --force origin/master

(This puts the local repository in a detached head state, which tends make manual working with it unpleasant; but at this point we have reserve this copy of the git repository for deployment only; manual work should be done elsewhere).

More Headaches

For very simple projects, using the git pull approach is fine. For more complex software, you have to tackle each of these problems, for example:

  • Clone the git repo first if no local copy exists
  • Discard local changes as discussed above (or remove the old copy, and always clone anew)
  • Have a separate checkout location (possibly on a different server), build and test there.
  • Copy the result over to the destination machine (but exclude the .git dir).
  • Provide a way to declare dependencies, and install them before doing the final copy step.
  • Provide a way to restart services after the copying

So you could build all these solutions -- or realize that they exist. Having a dedicated build server is an established pattern, and there are lot of software solutions for dealing with that. As is building a distributable software package (like .deb or .rpm packages), for which distribution systems exist -- the operating system vendors use it all the time.

Once you build Debian packages, the package manager ensure that dependencies are installed for you, and the postinst scripts provide a convenient location for restarting services.

If you chose that road, you gets lots of established tooling that wasn't explicitly mentioned above, but which often makes live much easier: Querying the database of existing packages, listing installed versions, finding which package a file comes from, extra security through package signing and signature verification, the ability to create meta packages, linter that warn about common packaging mistakes, and so on.

I'm a big fan of reusing existing solutions where it makes sense, and I feel this is a space where reusing can save huge amounts of time. Many of these tools have hundreds of corner cases already ironed out, and if you tried to tackle them yourself, you'd be stuck in a nearly endless exercise of yak shaving.

Thus I want to talk about the key steps in more detail: Building Debian packages, distributing them and installing them. And some notes on how to put them all together with existing tooling.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Dave's Free Press: Journal: I Love Github

Perlgeek.de : Automating Deployments: Debian Packaging for an Example Project

After general notes on Debian packaging, I want to introduce an example project, and how it's packaged.

The Project

package-info is a minimalistic web project, written solely for demonstrating packaging and deployment. When called in the browser, it produces a text document containing the output of dpkg -l, which gives an overview of installed (and potentially previously installed) packages, their version, installation state and a one-line description.

It is written in Perl using the Mojolicious web framework.

The actual code resides in the file usr/lib/package-info/package-info and is delightfully short:

#!/usr/bin/perl
use Mojolicious::Lite;

plugin 'Config';

get '/' => sub {
    my $c = shift;

    $c->render(text => scalar qx/dpkg -l/, format => 'text');
};

app->start;

It loads the "Lite" version of the framework, registers a route for the URL /, which renders as plain text the output of the system command dpkg -l, and finally starts the application.

It also loads the Config-Plugin, which is used to specify the PID file for the server process.

The corresponding config file in etc/package-info.conf looks like this:

#!/usr/bin/perl
{
    hypnotoad => {
        pid_file => '/var/run/package-info/package-info.pid',
    },
}

which again is perl code, and specifies the location of the PID file when run under hypnotoad, the application server recommended for use with Mojolicious.

To test it, you can install the libmojolicious-perl package, and run MOJO_CONFIG=$PWD/etc/package-info.conf morbo usr/lib/package-info/package-info. This starts a development server on port 3000. Pointing your browser at http://127.0.0.1:3000/, you should see a list like this:

Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                  Version                              Architecture Description
+++-=====================================-====================================-============-===============================================================================
ii  ack-grep                              2.14-4                               all          grep-like program specifically for large source trees
ii  acl                                   2.2.52-2                             amd64        Access control list utilities
rc  acroread-debian-files                 0.2.5                                amd64        Debian specific parts of Adobe Acrobat Reader
ii  adduser                               3.113+nmu3                           all          add and remove users and groups
ii  adwaita-icon-theme                    3.14.0-2                             all          default icon theme of GNOME

though much longer.

Initial Packaging

Installing dh-make and running dh_make --createorig -p package-info_0.1 gives us a debian directory along with several files.

I started by editing debian/control to look like this:

Source: package-info
Section: main
Priority: optional
Maintainer: Moritz Lenz 
Build-Depends: debhelper (>= 9)
Standards-Version: 3.9.5

Package: package-info
Architecture: all
Depends: ${misc:Depends}, libmojolicious-perl
Description: Web service for getting a list of installed packages

Debian packages support the notion of source package, which a maintainer uploads to the Debian build servers, and from which one or more binary package are built. The control reflects this structure, with the first half being about the source package and its build dependencies, and the second half being about the binary package.

Next I deleted the file debian/source/format, which by default indicates the use of the quilt patch management system, which isn't typically used in git based workflows.

I leave debian/rules, debian/compat and debian/changelog untouched, and create a file debian/install with two lines:

etc/package-info.conf
usr/lib/package-info/package-info

In lieu of a proper build system, this tells dh_install which files to copy into the debian package.

This is a enough for a building a Debian package. To trigger the build, this command suffices:

debuild -b -us -uc

The -b instructs debuild to only create a binary package, and the two -u* options skips the steps where debuild cryptographically signs the generated files.

This command creates three files in the directory above the source tree: package-info_0.1-1_all.deb, package-info_0.1-1_amd64.changes and package-info_0.1-1_amd64.build. The .deb file contains the actual program code and meta data, the .changes file meta data about the package as well as the last changelog entry, and the .build file a transcript of the build process.

A Little Daemonology

Installing the .deb file from the previous step would install a working software, but you'd have to start it manually.

Instead, it is useful to provide means to automatically start the server process at system boot time. Traditionally, this has been done by shipping init scripts. Since Debian transitioned to systemd as its init system with the "Jessie" / 8 version, systemd service files are the new way to go, and luckily much shorter than a robust init script.

The service file goes into debian/package-info.service:

[Unit]
Description=Package installation information via http
Requires=network.target
After=network.target

[Service]
Type=simple
RemainAfterExit=yes
SyslogIdentifier=package-info
PIDFile=/var/run/package-info/package-info.pid
Environment=MOJO_CONFIG=/etc/package-info.conf
ExecStart=/usr/bin/hypnotoad /usr/lib/package-info/package-info -f
ExecStop=/usr/bin/hypnotoad -s /usr/lib/package-info/package-info
ExecReload=/usr/bin/hypnotoad /usr/lib/package-info/package-info

The [Unit] section contains the service description, as well as the specification when it starts. The [Service] section describes the service type, where simple means that systemd expects the start command to not terminate as long as the process is running. With Environment, environment variables can be set for all three of the ExecStart, ExecStop and ExecReload commands.

Another debhelper, dh-systemd takes care of installing the service file, as well as making sure the service file is read and the service started or restarted after a package installation. To enable it, dh-systemd must be added to the Build-Depends line in file debian/control, and the catch-all build rule in debian/rules be changed to:

%:
        dh $@ --with systemd

To enable hypnotoad to write the PID file, the containing directory must exists. Writing /var/run/package-info/ into a new debian/dirs file ensures this directory is created at package installation.

To test the changes, again invoke debuild -b -us -uc and install the resulting .deb file with sudo dpkg -i ../package-info_0.1-1_all.deb.

The server process should now listen on port 8080, so you can test it with curl http://127.0.0.1:8080/ | head.

A Bit More Security

As it is now, the application server and the application run as the root user, which violates the Principle of least privilege. Instead it should run as a separate user, package-info that isn't allowed to do much else.

To make the installation as smooth as possible, the package should create the user itself if it doesn't exist. The debian/postinst script is run at package installation time, and is well suited for such tasks:

#!/bin/sh

set -e
test $DEBIAN_SCRIPT_DEBUG && set -v -x

export PATH=$PATH:/sbin:/usr/sbin:/bin:/usr/bin

USER="package-info"

case "$1" in
    configure)
        if ! getent passwd $USER >/dev/null ; then
            adduser --system $USER
        fi
        chown -R $USER /var/run/package-info/
    ;;
esac

#DEBHELPER#

exit 0

There are several actions that a postinst script can execute, and configure is the right one for creating users. At this time, the files are already installed.

Note that it also changes the permissions for the directory in which the PID file is created, so that when hypnotoad is invoked as the package-info user, it can still create the PID file.

Please note the presence of the #DEBHELPER# tag, which the build system replaces with extra actions. Some of these come from dh-systemd, and take care of restarting the service after installation, and enabling it for starting after a reboot on first installation.

To set the user under which the service runs, adding the line User=package-info to the [UNIT] section of debian/package-info.service.

Linux offers more security features that can be enabled in a declarative way in the systemd service file in the [Unit] section. Here are a few that protect the rest of the system from the server process, should it be exploited:

PrivateTmp=yes
InaccessibleDirectories=/home
ReadOnlyDirectories=/bin /sbin /usr /lib /etc

Additional precautions can be taken by limiting the number of processes that can be spawned and the available memory through the LimitNPROC and MemoryLimit options.

The importance of good packaging

If you tune your packages so that they do as much configuration and environment setup themselves, you benefit two-fold. It makes it easy to the package in any context, regardless of whether it is embedded in a deployment system. But even if it is part of a deployment system, putting the package specific bits into the package itself helps you keep the deployment system generic, and thus easy to extend to other packages.

For example configuration management systems such as Ansible, Chef and Puppet allow you to create users and to restart services when a new package version is available, but if you rely on that, you have to treat each package separately in the configuration management system.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Dave's Free Press: Journal: Palm Treo call db module

Perlgeek.de : Automating Deployments: Installing Packages

After the long build-up of building and distributing and authenticating packages, actually installing them is easy. On the target system, run

$ apt-get update $ apt-get install package-info

(replace package-info with the package you want to install, if that deviates from the example used previously).

If the package is of high quality, it takes care of restarting services where necessary, so no additional actions are necessary afterwards.

Coordination with Ansible

If several hosts are needed to provide a service, it can be beneficial to coordinate the update, for example only updating one or two hosts at a time, or doing a small integration test on each after moving on to the next.

A nice tool for doing that is Ansible, an open source IT automation system.

Ansibles starting point is an inventory file, which lists that hosts that Ansible works with, optionally in groups, and how to access them.

It is best practice to have one inventory file for each environment (production, staging, development, load testing etc.) with the same group names, so that you can deploy to a different environment simply by using a different inventory file.

Here is an example for an inventory file with two web servers and a database server:

# production
[web]
www01.yourorg.com
www02.yourorg.com

[database]
db01.yourorg.com

[all:vars]
ansible_ssh_user=root

Maybe the staging environment needs only a single web server:

# staging
[web]
www01.staging.yourorg.com

[database]
db01.stagingyourorg.com

[all:vars]
ansible_ssh_user=root

Ansible is organized in modules for separate tasks. Managing Debian packages is done with the apt module:

$ ansible -i staging web -m apt -a 'name=package-info update_cache=yes state=latest'

The -i option specifies the path to the inventory file, here staging. The next argument is the group of hosts (or a single host, if desired), and -m apt tells Ansible to use the apt module.

What comes after the -a is a module-specific command. name specifies a Debian package, update_cache=yes forces Ansible to run apt-get update before installing the latest version, and state=latest says that that's what we want to do.

If instead of the latest version we want a specific version, -a 'name=package-info=0.1 update_cache=yes state=present force=yes' is the way to go. Without force=yes, apt wouldn't downgrade the module to actually get the desired version.

This uses the ad-hoc mode of Ansible. More sophisticated deployments use playbooks, of which I hope to write more later. Those also allow you to do configuration tasks such as adding repository URLs and GPG keys for package authentication.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Perlgeek.de : Automating Deployments: Building in the Pipeline

The first step of an automated deployment system is always the build. (For a software that doesn't need a build to be tested, the test might come first, but stay with me nonetheless).

At this point, I assume that there is already a build system in place that produces packages in the desired format, here .deb files. Here I will talk about integrating this build step into a pipeline that automatically polls new versions from a git repository, runs the build, and records the resulting .deb package as a build artifact.

A GoCD Build Pipeline

As mentioned earlier, my tool of choice of controlling the pipeline is Go Continuous Delivery. Once you have it installed and configured and agent, you can start to create a pipeline.

GoCD let's you build pipelines in its web interface, which is great for exploring the available options. But for a blog entry, it's easier to look at the resulting XML configuration, which you can also enter directly ("Admin" → "Config XML").

So without further ado, here's the first draft:


  <pipelines group="deployment">
    <pipeline name="package-info">
      <materials>
        <git url="https://github.com/moritz/package-info.git" dest="package-info" />
      </materials>
      <stage name="build" cleanWorkingDir="true">
        <jobs>
          <job name="build-deb" timeout="5">
            <tasks>
              <exec command="/bin/bash" workingdir="package-info">
                <arg>-c</arg>
                <arg>debuild -b -us -uc</arg>
              </exec>
            </tasks>
            <artifacts>
              <artifact src="package-info*_*" dest="package-info/" />
            </artifacts>
          </job>
        </jobs>
      </stage>
    </pipeline>
  </pipelines>

The outer-most group is a pipeline group, which has a name. It can be used to make it easier to get an overview of available pipelines, and also to manage permissions. Not very interesting for now.

The second level is the <pipeline> with a name, and it contains a list of materials and one or more stages.

Materials

A material is anything that can trigger a pipeline, and/or provide files that commands in a pipeline can work with. Here the only material is a git repository, which GoCD happily polls for us. When it detects a new commit, it triggers the first stage in the pipeline.

Directory Layout

Each time a job within a stage is run, the go agent (think worker) which runs it prepares a directory in which it makes the materials available. On linux, this directory defaults to /var/lib/go-agent/pipelines/$pipline_name. Paths in the GoCD configuration are typically relative to this path.

For example the material definition above contains the attribute dest="package-info", so the absolute path to this git repository is /var/lib/go-agent/pipelines/package-info/package-info. Leaving out the dest="..." works, and gives on less level of directory, but only works for a single material. It is a rather shaky assumption that you won't need a second material, so don't do that.

See the config references for a list of available material types and options. Plugins are available that add further material types.

Stages

All the stages in a pipeline run serially, and each one only if the previous stage succeed. A stage has a name, which is used both in the front end, and for fetching artifacts.

In the example above, I gave the stage the attribute cleanWorkingDir="true", which makes GoCD delete files created during the previous build, and discard changes to files under version control. This tends to be a good option to use, otherwise you might unknowingly slide into a situation where a previous build affects the current build, which can be really painful to debug.

Jobs, Tasks and Artifacts

Jobs are potentially executed in parallel within a stage, and have names for the same reasons that stages do.

Inside a job there can be one or more tasks. Tasks are executed serially within a job. I tend to mostly use <exec> tasks (and <fetchartifact>, which I will cover in a later blog post), which invoke system commands. They follow the UNIX convention of treating an exit status of zero as success, and everything else as a failure.

For more complex commands, I create shell or Perl scripts inside a git repository, and add repository as a material to the pipeline, which makes them available during the build process with no extra effort.

The <exec> task in our example invokes /bin/bash -c 'debuild -b -us -uc'. Which is a case of Cargo Cult Programming, because invoking debuild directly works just as well. Ah well, will revise later.

debuild -b -us -uc builds the Debian package, and is executed inside the git checkout of the source. It produces a .deb file, a .changes file and possibly a few other files with meta data. They are created one level above the git checkout, so in the root directory of the pipeline run.

These are the files that we want to work with later on, we let GoCD store them in an internal database. That's what the <artifact> instructs GoCD to do.

Since the name of the generated files depends on the version number of the built Debian package (which comes from the debian/changelog file in the git repo), it's not easy to reference them by name later on. That's where the dest="package-info/" comes in play: it makes GoCD store the artifacts in a directory with a fixed name. Later stages then can retrieve all artifact files from this directory by the fixed name.

The Pipeline in Action

If nothing goes wrong (and nothing ever does, right?), this is roughly what the web interface looks like after running the new pipeline:

So, whenever there is a new commit in the git repository, GoCD happily builds a Debian pacckage and stores it for further use. Automated builds, yay!

But there is a slight snag: It recycles version numbers, which other Debian tools are very unhappy about. In the next blog post, I'll discuss a way to deal with that.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Perlgeek.de : Architecture of a Deployment System

An automated build and deployment system is structured as a pipeline.

A new commit or branch in a version control system triggers the instantiation of the pipeline, and starts executing the first of a series of stages. When a stage succeeds, it triggers the next one. If it fails, the entire pipeline instance stops.

Then manual intervention is necessary, typically by adding a new commit that fixes code or tests, or by fixing things with the environment or the pipeline configuration. A new instance of the pipeline then has a chance to succeed.

Deviations from the strict pipeline model are possible: branches, potentially executed in parallel, for example allow running different tests in different environments, and waiting with the next step until both are completed successfully.

The typical stages are building, running the unit tests, deployment to a first test environment, running integration tests there, potentially deployment to and tests in various test environments, and finally deployment to production.

Sometimes, these stages blur a bit. For example, a typical build of Debian packages also runs the unit tests, which alleviates the need for a separate unit testing stage. Likewise if the deployment to an environment runs integration tests for each host it deploys to, there is no need for a separate integration test stage.

Typically there is a piece of software that controls the flow of the whole pipeline. It prepares the environment for a stage, runs the code associated with the stage, collects its output and artifacts (that is, files that the stage produces and that are worth keeping, like binaries or test output), determines whether the stage was successful, and then proceeds to the next.

From an architectural standpoint, it relieves the stages of having to know what stage comes next, and even how to reach the machine on which it runs. So it decouples the stages.

Anti-Pattern: Separate Builds per Environment

If you use a branch model like git flow for your source code, it is tempting to automatically deploy the develop branch to the testing environment, and then make releases, merge them into the master branch, and deploy that to the production environment.

It is tempting because it is a straight-forward extension of an existing, proven workflow.

Don't do it.

The big problem with this approach is that you don't actually test what's going to be deployed, and on the flip side, deploy something untested to production. Even if you have a staging environment before deploying to production, you are invalidating all the testing you did the testing environment if you don't actually ship the binary or package that you tested there.

If you build "testing" and "release" packages from different sources (like different branches), the resulting binaries will differ. Even if you use the exact same source, building twice is still a bad idea, because many builds aren't reproducible. Non-deterministic compiler behavior, differences in environments and dependencies all can lead to packages that worked fine in one build, and failed in another.

It is best to avoid such potential differences and errors by deploying to production exactly the same build that you tested in the testing environment.

Differences in behavior between the environments, where they are desirable, should be implemented by configuration that is not part of the build. (It should be self-evident that the configuration should still be under version control, and also automatically deployed. There are tools that specialize in deploying configuration, like Puppet, Chef and Ansible.)


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Dave's Free Press: Journal: Travelling in time: the CP2000AN

Dave's Free Press: Journal: Graphing tool

Dave's Free Press: Journal: XML::Tiny released

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 1

Dave's Free Press: Journal: Thanks, Yahoo!

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 2

Ocean of Awareness: What are the reasonable computer languages?

"You see things; and you say 'Why?' But I dream things that never were; and I say 'Why not?'" -- George Bernard Shaw

In the 1960's and 1970's computer languages were evolving rapidly. It was not clear which way they were headed. Would most programming be done with general-purpose languages? Or would programmers create a language for every task domain? Or even for every project? And, if lots of languages were going to be created, what kinds of languages would be needed?

It was in that context that Čulik and Cohen, in a 1973 paper, outlined what they thought programmers would want and should have. In keeping with the spirit of the time, it was quite a lot:

  • Programmers would want to extend their grammars with new syntax, including new kinds of expressions.
  • Programmers would also want to use tools that automatically generated new syntax.
  • Programmers would not want to, and especially in the case of auto-generated syntax would usually not be able to, massage the syntax into very restricted forms. Instead, programmers would create grammars and languages which required unlimited lookahead to disambiguate, and they would require parsers which could handle these grammars.
  • Finally, programmers would need to be able to rely on all of this parsing being done in linear time.

Today, we think we know that Čulik and Cohen's vision was naive, because we think we know that parsing technology cannot support it. We think we know that parsing is much harder than they thought.

The eyeball grammars

As a thought problem, consider the "eyeball" class of grammars. The "eyeball" class of grammars contains all the grammars that a human can parse at a glance. If a grammar is in the eyeball class, but a computer cannot parse it, it presents an interesting choice. Either,

  • your computer is not using the strongest practical algorithm; or
  • your mind is using some power which cannot be reduced to a machine computation.

There are some people out there (I am one of them) who don't believe that everything the mind can do reduces to a machine computation. But even those people will tend to go for the choice in this case: There must be some practical computer parsing algorithm which can do at least as well at parsing as a human can do by "eyeball". In other words, the class of "reasonable grammars" should contain the eyeball class.

Čulik and Cohen's candidate for the class of "reasonable grammars" were the grammars that a deterministic parse engine could parse if it had a lookahead that was infinite, but restricted to distinguishing between regular expressions. They called these the LR-regular, or LRR, grammars. And the LRR grammars do in fact seem to be a good first approximation to the eyeball class. They do not allow lookahead that contains things that you have to count, like palindromes. And, while I'd be hard put to eyeball every possible string for every possible regular expression, intuitively the concept of scanning for a regular expression does seem close to capturing the idea of glancing through a text looking for a telltale pattern.

So what happened?

Alas, the algorithm in the Čulik and Cohen paper turned out to be impractical. But in 1991, Joop Leo discovered a way to adopt Earley's algorithm to parse the LRR grammars in linear time, without doing the lookahead. And Leo's algorithm does have a practical implementation: Marpa.

References, comments, etc.

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Perlgeek.de : Technology for automating deployments: the agony of choice

As an interlude I'd like to look at alternative technology stacks that you could use in your deployment project. I'm using to a certain stack because it makes sense in the context of the bigger environment.

This is a mostly Debian-based infrastructure with its own operations team, and software written in various (mostly dynamic) programming languages.

If your organization writes only Java code (or code in programming languages that are based on the JVM), and your operations folks are used to that, it might be a good idea to ship .jar files instead. Then you need a tool that can deploy them, and a repository to store them.

Package format

I'm a big fan of operating system packages, for three reasons: The operators are famiilar with them, they are language agnostic, and configuration management software typically supports them out of the box.

If you develop applications in several different programming languages, say perl, python and ruby, it doesn't make sense to build a deployment pipeline around three different software stacks and educate everybody involved about how to use and debug each of the language-specific package managers. It is much more economical to have the maintainer of each application build a system package, and then use one toolchain to deploy that.

That doesn't necessarily imply building a system package for each upstream package. Fat-packaging is a valid way to avoid an explosion of packaging tasks, and also of avoiding clashes when dependencies on conflicting version of the same package exists. dh-virtualenv works well for python software and all its python dependencies into a single Debian package; only the python interpreter itself needs to be installed on the target machine.

If you need to deploy to multiple operating system families and want to build only one package, nix is an interesting approach, with the additional benefit of allowing parallel installation of several versions of the same package. That can be useful for running two versions in parallel, and only switching over to the new one for good when you're convinced that there are no regressions.

Repository

The choice of package format dictates the repository format. Debian packages are stored in a different structure than Pypi packages, for example. For each repository format there is tooling available to help you create and update the repository.

For Debian, aptly is my go-to solution for repository management. Reprepro seems to be a decent alternative.

Pulp is a rather general and scalable repository management software that was originally written for RPM packages, but now also supports Debian packages, Python (pypi) packages and more. Compared to the other solutions mentioned so far (which are just command line programs you run when you need something, and file system as storage), it comes with some administrative overhead, because there's at least a MongoDB database and a RabbitMQ message broker required to run it. But when you need such a solution, it's worth it.

A smaller repository management for Python is pip2pi. In its simplest form you just copy a few .tar.gz files into a directory and run dir2pi . in that directory, and make it accessible through a web server.

CPAN-Mini is a good and simple tool for providing a CPAN mirror, and CPAN-Mini-Inject lets you add your own Perl modules, in the simplest case through the mcpani script.

Installation

Installing a package and its dependencies often looks easy on the surface, something like apt-get update && apt-get install $package. But that is deceptive, because many installers are interactive by nature, or require special flags to force installation of an older version, or other potential pitfalls.

Ansible provides modules for installing .deb packages, python modules, perl, RPM through yum, nix packages and many others. It also requires little up-front configuration on the destination system and is very friendly for beginners, but still offers enough power for more complex deployment tasks. It can also handle configuration management.

An alternative is Rex, with which I have no practical experience.

Not all configuration management systems are good fits for managing deployments. For example Puppet doesn't seem to have a good way to provide an order for package upgrades ("first update the backend on servers bck01 and bck02, and then frontend on www01, and the rest of the backend servers").


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Perlgeek.de : Automating Deployments: New Website, Community

No IT project would be complete without its own website, so my project of writing a book on automated deployments now has one too. Please visit deploybook.com and tell me what you think about. The plan is to gradually add some content, and maybe also a bit of color. When the book finally comes out (don't hold your breath here, it'll take some more months), it'll also be for sale there.

Quite a few readers have emailed me, sharing their own thoughts on the topic of building and deploying software, and often asking for feedback. There seems to be a need for a place to share these things, so I created one. The deployment community is a discourse forum dedicated to discussing such things. I expect to see a low volume of posts, but valuables to those involved.

To conclude the news roundup for this week, I'd like to mention that next week I'll be giving a talk on continuous delivery at the German Perl Workshop 2016, where I'm also one of the organizers. After the workshop is over, slides will be available online, and I'll likely have more time again for blogging. So stay tuned!


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Dave's Free Press: Journal: YAPC::Europe 2007 travel plans

Dave's Free Press: Journal: Wikipedia handheld proxy

Perlgeek.de : Automating Deployments:

In a previous installment we saw a GoCD configuration that automatically built a Debian package from a git repository whenever somebody pushes a new commit to the git repo.

The version of the generated Debian package comes from the debian/changelog file of the git repository. Which means that whenever somebody pushes code or doc changes without a new changelog entry, the resulting Debian package has the same version number as the previous one.

The problem with this version recycling is that most Debian tooling assumes that the tuple of package name, version and architecture uniquely identifies a revision of a package. So stuffing a new version of a package with an old version number into a repository is bound to cause trouble; most repository management software simply refuses to accept that. On the target machine, upgrade the package won't do anything if the version number stays the same.

So, its a good idea to put a bit more thought into the version string of the automatically built Debian package.

Constructing Unique Version Numbers

There are several source that you can tap to generate unique version numbers:

  • Randomness (for example in the form of UUIDs)
  • The current date and time
  • The git repository itself
  • GoCD exposes several environment variables that can be of use

The latter is quite promising: GO_PIPELINE_COUNTER is a monotonic counter that increases each time GoCD runs the pipeline, so a good source for a version number. GoCD allows manual re-running of stages, so it's best to combine it with GO_STAGE_COUNTER. In terms of shell scripting, using $GO_PIPELINE_COUNTER.$GO_STAGE_COUNTER as a version string sounds like a decent approach.

But, there's more. GoCD allows you to trigger a pipeline with a specific version of a material, so you can have a new pipeline run to build an old version of the software. If you do that, using GO_PIPELINE_COUNTER as the first part of the version string doesn't reflect the use of an old code base.

To construct a version string that primarily reflects the version of the git repository, and only secondarily the build iteration, the first part of the version string has to come from git. As a distributed version control system, git doesn't supply a single, numeric version counter. But if you limit yourself to a single repository and branch, you can simply count commits.

git describe is an established way to count commits. By default it prints the last tag in the repo, and if HEAD does not resolve to the same commit as the tag, it adds the number of commits since that tag, and the abbreviated sha1 hash prefixed by g, so for example 2016.04-32-g4232204 for the commit 4232204, which is 32 commits after the tag 2016.04. The option --long forces it to always print the number of commits and the hash, even when HEAD points to a tag.

We don't need the commit hash for the version number, so a shell script to construct a good version number looks like this:

#!/bin/bash

set -e
set -o pipefail
version=$(git describe --long |sed 's/-g[A-Fa-f0-9]*$//')
version="$version.${GO_PIPELINE_COUNTER:-0}.${GO_STAGE_COUNTER:-0}"

Bash's ${VARIABLE:-default} syntax is a good way to make the script work outside a GoCD agent environment.

This script requires a tag to be set in the git repository. If there is none, it fails with this message from git describe:

fatal: No names found, cannot describe anything.

Other Bits and Pieces Around the Build

Now that we have a version string, we need to instruct the build system to use this version string. This works by writing a new entry in debian/changelog with the desired version number. The debchange tool automates this for us. A few options are necessary to make it work reliably:

export DEBFULLNAME='Go Debian Build Agent'
export DEBEMAIL='go-noreply@example.com'
debchange --newversion=$version  --force-distribution -b  \
    --distribution="${DISTRIBUTION:-jessie}" 'New Version'

When we want to reference this version number in later stages in the pipeline (yes, there will be more), it's handy to have it available in a file. It is also handy to have it in the output, so two more lines to the script:

echo $version
echo $version > ../version

And of course, trigger the actual build:

debuild -b -us -uc

Plugging It Into GoCD

To make the script accessible to GoCD, and also have it under version control, I put it into a git repository under the name debian-autobuild and added the repo as a material to the pipeline:

<pipeline name="package-info">
  <materials>
    <git url="https://github.com/moritz/package-info.git" dest="package-info" />
    <git url="https://github.com/moritz/deployment-utils.git" dest="deployment-utils" materialName="deployment-utils" />
  </materials>
  <stage name="build" cleanWorkingDir="true">
    <jobs>
      <job name="build-deb" timeout="5">
        <tasks>
          <exec command="../deployment-utils/debian-autobuild" workingdir="#{package}" />
        </tasks>
        <artifacts>
          <artifact src="version" />
          <artifact src="package-info*_*" dest="package-info/" />
        </artifacts>
      </job>
    </jobs>
  </stage>
</pipeline>

Now GoCD automatically builds Debian packages on each commit to the git repository, and gives each a distinct version string.

The next step is to add it to a repository, so that it can be installed on a target machine with a simple apt-get command.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Dave's Free Press: Journal: Bryar security hole

Dave's Free Press: Journal: POD includes

Dave's Free Press: Journal: cgit syntax highlighting

Ocean of Awareness: Grammar reuse

Every year the Perl 6 community creates an "Advent" series of posts. I always follow these, but one in particular caught my attention this year. It presents a vision of a future where programming is language-driven. A vision that I share. The post went on to encourage its readers to follow up on this vision, and suggested an approach. But I do not think the particular approach suggested would be fruitful. In this post I'll explain why.

Reuse

The focus of the Advent post was language-driven programming, and that is the aspect that excites me most. But the points that I wish to make are more easily understood if I root them in a narrower, but more familiar issue -- grammar reuse.

Most programmers will be very familiar with grammar reuse from regular expressions. In the regular expression ("RE") world, programming by cutting and pasting is very practical and often practiced.

For this post I will consider grammar reusability to be the ability to join two grammars and create a third. This is also sometimes called grammar composition. For this purpose, I will widen the term "grammar" to include RE's and PEG parser specifications. Ideally, when you compose two grammars, what you get is

  • a language you can reasonably predict, and
  • if each of the two original grammars can be parsed in reasonable time, a language that can be parsed in reasonable time.

Not all language representations are reusable. RE's are, and BNF is. PEG looks like a combination of BNF and RE's, but PEG, in fact, is its own very special form of parser specification. And PEG parser specifications are one of the least reusable language representations ever invented.

Reuse and regular expressions

RE's are as well-behaved under reuse as a language representation can get. The combination of two RE's is always another RE, and you can reasonably determine what language the combined RE recognizes by examining it. Further, every RE is parseable in linear time.

The one downside, often mentioned by critics, is that RE's do not scale in terms of readability. Here, however, the problem is not really one of reusability. The problem is that RE's are quite limited in their capabilities, and programmers often exploit the excellent behavior of RE's under reuse to push them into applications for which RE's just do not have the power.

Reuse and PEG

When programmers first look at PEG syntax, they often think they've encountered paradise. They see both BNF and RE's, and imagine they'll have the best of each. But the convenient behavior of RE's depends on their unambiguity. You simply cannot write an unambiguous RE -- it's impossible.

More powerful and more flexible, BNF allows you to describe many more grammars -- including ambiguous ones. How does PEG resolve this? With a Gordian knot approach. Whenever it encounters an ambiguity, it throws all but one of the choices away. The author of the PEG specification gets some control over what is thrown away -- he specifies an order of preference for the choices. But degree of control is less than it seems, and in practice PEG is the nitroglycerin of parsing -- marvelous when it works, but tricky and dangerous.

Consider these 3 PEG specifications:

	("a"|"aa")"a"
	("aa"|"a")"a"
	A = "a"A"a"/"aa"

All three clearly accept only strings which are repetitions of the letter "a". But which strings? For the answers, suggestions for dealing with PEG if you are committed to it, and more, look at my previous post on PEG.

When getting an RE or a BNF grammar to work, you can go back to the grammar and ask yourself "Does my grammar look like my intended language?". With PEG, this is not really possible. With practice, you might get used to figuring out single line PEG specs like the first two above. (If you can get the last one, you're amazing.) But tracing these through multiple rule layers required by useful grammars is, in practice, not really possible.

In real life, PEG specifications are written by hacking them until the test suite works. And, once you get a PEG specification to pass the test suite for a practical-sized grammar, you are very happy to leave it alone. Trying to compose two PEG specifications is rolling the dice with the odds against you.

Reuse and the native Perl 6 parser

The native Perl 6 parser is an extended PEG parser. The extensions are very interesting from the PEG point of view. The PEG "tie breaking" has been changed, and backtracking can be used. These features mean the Perl 6 parser can be extended to languages well beyond what ordinary PEG parsers can handle. But, if you use the extra features, reuse will be even trickier than if you stuck with vanilla PEG.

Reuse and general BNF parsing

As mentioned, general BNF is reusable, and so general BNF parsers like Marpa are as reusable as regular expressions, with two caveats. First, if the two grammars are not doing their own lexing, their lexers will have to be compatible.

Second, with regular expressions you had the advantage that every regular expression parses in linear time, so that speed was guaranteed to be acceptable. Marpa users reuse grammars and pieces of grammars all the time. The result is always the language specified by the merged BNF, and I've never heard anyone complain that performance deterioriated.

But, while it may not happen often, it is possible to combine two Marpa grammars that run in linear time and end up with one that does not. You can guarantee your merged Marpa grammar will stay linear if you follow 2 rules:

  • keep the grammar unambiguous;
  • don't use an unmarked middle recursion.

Unmarked middle recursions are not things you're likely to need a lot: they are those palindromes where you have to count to find the middle: grammars like "A ::= a | a A a". If you use a middle recursion at all, it is almost certainly going to be marked, like "A ::= b | a A a", which generates strings like "aabaa". With Marpa, as with RE's, reuse is easy and practical. And, as I hope to show in a future post, unlike RE's, Marpa opens the road to language-driven programming.

Perl 6

I'm a fan of the Perl 6 effort. I certainly should be a supporter, after the many favors they've done for me and the Marpa community over the years. The considerations of this post will disappoint some of the hopes for applications of the native Perl 6 parser. But these applications have not been central to the Perl 6 effort, of which I will be an eager student over the coming months.

Comments

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Perlgeek.de : Automating Deployments: Building Debian Packages

I have argued before that it is a good idea to build packages from software you want to automatically deploy. The package manager gives you dependency management as well as the option to execute code at defined points in the installation process, which is very handy for restarting services after installation, creating necessary OS-level users and so on.

Which package format to use?

There are many possible package formats and managers for them out there. Many ecosystems and programming languages come with their own, for example Perl uses CPAN.pm or cpanminus to install Perl modules, the NodeJS community uses npm, ruby has the gem installer, python pip and easyinstall, and so on.

One of the disadvantages is that they only work well for one language. If you or your company uses applications uses software in multiple programming languages, and you chose to use the language-specific packaging formats and tools for each, you burden yourself and the operators with having to know (and being aware of) these technologies.

Operating teams are usually familiar with the operating system's package manager, so using that seems like an obvious choice. Especially if the same operating system family is used throughout the whole organization. In specialized environments, other solutions might be preferable.

What's in a Debian package, and how do I build one?

A .deb file is an ar archive with meta data about the archive format version, meta data for the package (name, version, installation scripts) and the files that are to be installed.

While it is possible to build such a package directly, the easier and much more common route is to use the tooling provided by the devscripts package. These tools expect the existence of a debian/ directory with various files in them.

debian/control contains information such as the package name, dependencies, maintainer and description. debian/rules is a makefile that controls the build process of the debian package. debian/changelog contains a human-readable summary of changes to the package. The top-most changelog entry determines the resulting version of the package.

You can use dh_make from the dh-make package to generate a skeleton of files for the debian/ directory, which you can then edit to your liking. This will ask your for the architecture of the package. You can use a specific one like amd64, or the word any for packages that can be build on any architecture. If resulting package is architecture independent (as as is the case for many scripting languages), using all as the architecture is appropriate.

Build process of a Debian package

If you use dh_make to create a skeleton, debian/rules mostly consists of a catch-all rule that calls dh $@. This is tool that tries to do the right thing for each build step automatically, and usually it succeeds. If there is a Makefile in your top-level directory, it will call the configure, build, check and install make targets for you. If your build system installs into the DESTDIR prefix (which is set to debian/your-package-name), it should pretty much work out of the box.

If you want to copy additional files into the Debian package, listing the file names, one on each line, in debian/install, that is done automatically for you.

Shortcuts

If you have already packaged your code for distribution through language-specific tools, such as CPAN (Perl) or pip (Python), there are shortcuts to creating Debian Packages.

Perl

The tool dh-make-perl (installable via the package of the same name) can automatically create a debian directory based on the perl-specific packaging. Calling dh-make-perl . inside the root directory of your perl source tree is often enough to create a functional Debian package. It sticks to the naming convention that a Perl package Awesome::Module becomes libawesome-module-perl in Debian land.

Python

py2dsc from the python-stdeb package generates a debian/ directory from an existing python tarball.

Another approach is to use dh-virtualenv. This copies all of the python dependencies into a virtualenv, so the resulting packages only depends on the system python and possible C libraries that the python packages use; all python-level dependencies are baked in. This tends to produce bigger packages with fewer dependencies, and allows you to run several python programs on a single server, even if they depend on different versions of the same python library.

dh-virtualenv has an unfortunate choice of default installation prefix that clashes with some assumptions that Debian's python packages make. You can override that choice in debian/rules:

#!/usr/bin/make -f
export DH_VIRTUALENV_INSTALL_ROOT=/usr/share/yourcompany
%:
        dh $@ --with python-virtualenv --with systemd

It also assumes Pyhton 2 by default. For a Python 3 based project, add these lines:

override_dh_virtualenv:
        dh_virtualenv --python=/usr/bin/python3

(As always with Makefiles, be sure to indent with hard tabulator characters, not with spaces).


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Dave's Free Press: Journal: CPAN Testers' CPAN author FAQ

Perlgeek.de : Automating Deployments: Version Recycling Considered Harmful

In the previous installment we saw a GoCD configuration that automatically built a Debian package from a git repository whenever somebody pushes a new commit to the git repo.

The version of the generated Debian package comes from the debian/changelog file of the git repository. Which means that whenever somebody pushes code or doc changes without a new changelog entry, the resulting Debian package has the same version number as the previous one.

The problem with this version recycling is that most Debian tooling assumes that the tuple of package name, version and architecture uniquely identifies a revision of a package. So stuffing a new version of a package with an old version number into a repository is bound to cause trouble; most repository management software simply refuses to accept that. On the target machine, upgrade the package won't do anything if the version number stays the same.

So, its a good idea to put a bit more thought into the version string of the automatically built Debian package.

Constructing Unique Version Numbers

There are several source that you can tap to generate unique version numbers:

  • Randomness (for example in the form of UUIDs)
  • The current date and time
  • The git repository itself
  • GoCD exposes several environment variables that can be of use

The latter is quite promising: GO_PIPELINE_COUNTER is a monotonic counter that increases each time GoCD runs the pipeline, so a good source for a version number. GoCD allows manual re-running of stages, so it's best to combine it with GO_STAGE_COUNTER. In terms of shell scripting, using $GO_PIPELINE_COUNTER.$GO_STAGE_COUNTER as a version string sounds like a decent approach.

But, there's more. GoCD allows you to trigger a pipeline with a specific version of a material, so you can have a new pipeline run to build an old version of the software. If you do that, using GO_PIPELINE_COUNTER as the first part of the version string doesn't reflect the use of an old code base.

To construct a version string that primarily reflects the version of the git repository, and only secondarily the build iteration, the first part of the version string has to come from git. As a distributed version control system, git doesn't supply a single, numeric version counter. But if you limit yourself to a single repository and branch, you can simply count commits.

git describe is an established way to count commits. By default it prints the last tag in the repo, and if HEAD does not resolve to the same commit as the tag, it adds the number of commits since that tag, and the abbreviated sha1 hash prefixed by g, so for example 2016.04-32-g4232204 for the commit 4232204, which is 32 commits after the tag 2016.04. The option --long forces it to always print the number of commits and the hash, even when HEAD points to a tag.

We don't need the commit hash for the version number, so a shell script to construct a good version number looks like this:

#!/bin/bash

set -e
set -o pipefail
version=$(git describe --long |sed 's/-g[A-Fa-f0-9]*$//')
version="$version.${GO_PIPELINE_COUNTER:-0}.${GO_STAGE_COUNTER:-0}"

Bash's ${VARIABLE:-default} syntax is a good way to make the script work outside a GoCD agent environment.

This script requires a tag to be set in the git repository. If there is none, it fails with this message from git describe:

fatal: No names found, cannot describe anything.

Other Bits and Pieces Around the Build

Now that we have a version string, we need to instruct the build system to use this version string. This works by writing a new entry in debian/changelog with the desired version number. The debchange tool automates this for us. A few options are necessary to make it work reliably:

export DEBFULLNAME='Go Debian Build Agent'
export DEBEMAIL='go-noreply@example.com'
debchange --newversion=$version  --force-distribution -b  \
    --distribution="${DISTRIBUTION:-jessie}" 'New Version'

When we want to reference this version number in later stages in the pipeline (yes, there will be more), it's handy to have it available in a file. It is also handy to have it in the output, so two more lines to the script:

echo $version
echo $version > ../version

And of course, trigger the actual build:

debuild -b -us -uc

Plugging It Into GoCD

To make the script accessible to GoCD, and also have it under version control, I put it into a git repository under the name debian-autobuild and added the repo as a material to the pipeline:

<pipeline name="package-info">
  <materials>
    <git url="https://github.com/moritz/package-info.git" dest="package-info" />
    <git url="https://github.com/moritz/deployment-utils.git" dest="deployment-utils" materialName="deployment-utils" />
  </materials>
  <stage name="build" cleanWorkingDir="true">
    <jobs>
      <job name="build-deb" timeout="5">
        <tasks>
          <exec command="../deployment-utils/debian-autobuild" workingdir="#{package}" />
        </tasks>
        <artifacts>
          <artifact src="version" />
          <artifact src="package-info*_*" dest="package-info/" />
        </artifacts>
      </job>
    </jobs>
  </stage>
</pipeline>

Now GoCD automatically builds Debian packages on each commit to the git repository, and gives each a distinct version string.

The next step is to add it to a repository, so that it can be installed on a target machine with a simple apt-get command.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Ocean of Awareness: What parser do birds use?

"Here we provide, to our knowledge, the first unambiguous experimental evidence for compositional syntax in a non-human vocal system." -- "Experimental evidence for compositional syntax in bird calls", Toshitaka N. Suzuki, David Wheatcroft & Michael Griesser Nature Communications 7, Article number: 10986

In this post I look at a subset of the language of the Japanese great tit, also known as Parus major. The above cited article presents evidence that bird brains can parse this language. What about standard modern computer parsing methods? Here is the subset -- probably a tiny one -- of the language actually used by Parus major.

      S ::= ABC
      S ::= D
      S ::= ABC D
      S ::= D ABC
    

Classifying the Parus major grammar

Grammophone is a very handy new tool for classifying grammars. Its own parser is somewhat limited, so that it requires a period to mark the end of a rule. The above grammar is in Marpa's SLIF format, which is smart enough to use the "::=" operator to spot the beginning and end of rules, just as the human eye does. Here's the same grammar converted into a form acceptable to Grammophone:

      S -> ABC .
      S -> D .
      S -> ABC D .
      S -> D ABC .
    

Grammophone tells us that the Parus major grammar is not LL(1), but that it is LALR(1).

What does this mean?

LL(1) is the class of grammar parseable by top-down methods: it's the best class for characterizing most parsers in current use, including recursive descent, PEG, and Perl 6 grammars. All of these parsers fall short of dealing with the Parus major language.

LALR(1) is probably most well-known from its implementations in bison and yacc. While able to handle this subset of Parus's language, LALR(1) has its own, very strict, limits. Whether LALR(1) could handle the full complexity of Parus language is a serious question. But it's a question that in practice would probably not arise. LALR(1) has horrible error handling properties.

When the input is correct and within its limits, an LALR-driven parser is fast and works well. But if the input is not perfectly correct, LALR parsers produce no useful analysis of what went wrong. If Parus hears "d abc d", a parser like Marpa, on the other hand, can produce something like this:

# * String before error: abc d\s
# * The error was at line 1, column 7, and at character 0x0064 'd', ...
# * here: d
    

Parus uses its language in predatory contexts, and one can assume that a Parus with a preference for parsers whose error handling is on an LALR(1) level will not be keeping its alleles in the gene pool for very long.

References, comments, etc.

Those readers content with sub-Parus parsing methods may stop reading here. Those with greater parsing ambitions, however, may wish to learn more about Marpa. A Marpa test script for parsing the Parus subset is in a Github gist. Marpa has a semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Perlgeek.de : Automating Deployments: 3+ Environments

Software is written to run in a production environment. This is where the goal of the business is achieved: making money for the business, or reaching and educating people, or whatever the reason for writing the software is. For websites, this is the typically the Internet-facing public servers.

But the production environment is not where you want to develop software. Developing is an iterative process, and comes with its own share of mistakes and corrections. You don't want your customers to see all those mistakes as you make them, so you develop in a different environment, maybe on your PC or laptop instead of a server, with a different database (though hopefully using the same database software as in the production environment), possibly using a different authentication mechanism, and far less data than the production environment has.

You'll likely want to prevent certain interactions in the development environment that are desirable in production: Sending notifications (email, SMS, voice, you name it), charging credit cards, provisioning virtual machines, opening rack doors in your data center and so on. How that is done very much depends on the interaction. You can configure a mail transfer agent to deliver all mails to a local file or mail box. Some APIs have dedicated testing modes or installations; in the worst case, you might have to write a mock implementation that answers similarly to the original API, but doesn't carry out the action that the original API does.

Deploying software straight to production if it has only been tested on the developer's machine is a rather bad practice. Often the environments are too different, and the developer unknowingly relied on a feature of his environment that isn't the same in the production environment. Thus it is quite common to have one or more environments in between where the software is deployed and tested, and only propagated to the next deployment environment when all the tests in the previous one were successful.

After a software is modified in the development environment, it is
deployed to the testing environment (with its own database), and if all tests
were successful, propagated to the production environment.

One of these stages is often called testing. This is where the software is shown to the stakeholders to gather feedback, and if manual QA steps are required, they are often carried out in this environment (unless there is a separate environment for that).

A reason to have another non-production environment is test service dependencies. If several different software components are deployed to the testing environment, and you decide to deploy one or two at a time to production, things might break in production. The component you deployed might have a dependency on a newer version of another component, and since the testing environment contained that newer version, nobody noticed. Or maybe a database upgrade in the testing environment failed, and had to be repaired manually; you don't want the same to happen in a production setting, so you decide to test in another environment first.

After a software is modified in the development environment, it is
deployed to the testing environment (with its own database), and if all tests
were successful, propagated to the staging  environment. Only if this works is
the deployment to production carried out

Thus many companies have another staging environment that mirrors the production environment as closely as possible. A planned production deployment is first carried out in the staging environment, and on success done in production too, or rolled back on error.

There are valid reasons to have more environments even. If automated performance testing is performed, it should be done in an separate environment where no manual usage is possible to avoid distorting results. Other tests such as automated acceptance or penetration testings are best done in their own environment.

One can add more environments for automated acceptance, penetration
     and performance testing for example; those typically come before the
     staging environment.

In addition, dedicated environment for testing and evaluating explorative features are possible.

It should be noted that while these environment all serve valid purposes, they also come at a cost. Machines, either virtual or native, on which all those environments run must be available, and they consume resources. They must be set up initially and maintained. License costs must be considered (for example for proprietary databases). Also the time for deploying code increases as the number of environment increases. With more environments, automating deployments and maybe even management and configuration of the infrastructure becomes mandatory.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Dave's Free Press: Journal: Thankyou, Anonymous Benefactor!

Dave's Free Press: Journal: Number::Phone release

Perlgeek.de : Continuous Delivery for Libraries?

Past Thursday I gave a talk on Continuous Delivery (slides) at the German Perl Workshop 2016 (video recordings have been made, but aren't available yet). One of the questions from the audience was something along the lines of: would I use Continuous Delivery for a software library?

My take on this is that you typically develop a library driven by the needs of one or more applications, not just for the sake of developing a library. So you have some kind of pilot application which makes use of the new library features.

You can integrate the library into the application's build pipeline. Automatically build and unit test the library, and once it succeeds, upload the library into a repository. The build pipeline for the application can then download the newest version of the library, and include it in its build result (fat-packaging). The application build step now has two triggers: commits from its own version control repository, and library uploads.

Then the rest of the delivery pipeline for the application serves as quality gating for the library as well. If the pipeline includes integration tests and functional tests for the whole software stack, it will catch errors of the library, and deploy the library along with the application.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Dave's Free Press: Journal: Ill

Dave's Free Press: Journal: CPANdeps upgrade

Ocean of Awareness: Introduction to Marpa Book in progress

What follows is a summary of the features of the Marpa algorithm, followed by a discussion of potential applications. It refers to itself as a "monograph", because it is a draft of part of the introduction to a technical monograph on the Marpa algorithm. I hope the entire monograph will appear in a few weeks.

The Marpa project

The Marpa project was intended to create a practical and highly available tool to generate and use general context-free parsers. Tools of this kind had long existed for LALR and regular expressions. But, despite an encouraging academic literature, no such tool had existed for context-free parsing. The first stable version of Marpa was uploaded to a public archive on Solstice Day 2011. This monograph describes the algorithm used in the most recent version of Marpa, Marpa::R2. It is a simplification of the algorithm presented in my earlier paper.

A proven algorithm

While the presentation in this monograph is theoretical, the approach is practical. The Marpa::R2 implementation has been widely available for some time, and has seen considerable use, including in production environments. Many of the ideas in the parsing literature satisfy theoretical criteria, but in practice turn out to face significant obstacles. An algorithm may be as fast as reported, but may turn out not to allow adequate error reporting. Or a modification may speed up the recognizer, but require additional processing at evaluation time, leaving no advantage to compensate for the additional complexity.

In this monograph, I describe the Marpa algorithm as it was implemented for Marpa::R2. In many cases, I believe there are better approaches than those I have described. But I treat these techniques, however solid their theory, as conjectures. Whenever I mention a technique that was not actually implemented in Marpa::R2, I will always explicitly state that that technique is not in Marpa as implemented.

Features

General context-free parsing

As implemented, Marpa parses all "proper" context-free grammars. The proper context-free grammars are those which are free of cycles, unproductive symbols, and inaccessible symbols. Worst case time bounds are never worse than those of Earley's algorithm, and therefore never worse than O(n**3).

Linear time for practical grammars

Currently, the grammars suitable for practical use are thought to be a subset of the deterministic context-free grammars. Using a technique discovered by Joop Leo, Marpa parses all of these in linear time. Leo's modification of Earley's algorithm is O(n) for LR-regular grammars. Leo's modification also parses many ambiguous grammars in linear time.

Left-eidetic

The original Earley algorithm kept full information about the parse --- including partial and fully recognized rule instances --- in its tables. At every parse location, before any symbols are scanned, Marpa's parse engine makes available its information about the state of the parse so far. This information is in useful form, and can be accessed efficiently.

Recoverable from read errors

When Marpa reads a token which it cannot accept, the error is fully recoverable. An application can try to read another token. The application can do this repeatedly as long as none of the tokens are accepted. Once the application provides a token that is accepted by the parser, parsing will continue as if the unsuccessful read attempts had never been made.

Ambiguous tokens

Marpa allows ambiguous tokens. These are often useful in natural language processing where, for example, the same word might be a verb or a noun. Use of ambiguous tokens can be combined with recovery from rejected tokens so that, for example, an application could react to the rejection of a token by reading two others.

Using the features

Error reporting

An obvious application of left-eideticism is error reporting. Marpa's abilities in this respect are ground-breaking. For example, users typically regard an ambiguity as an error in the grammar. Marpa, as currently implemented, can detect an ambiguity and report specifically where it occurred and what the alternatives were.

Event driven parsing

As implemented, Marpa::R2 allows the user to define "events". Events can be defined that trigger when a specified rule is complete, when a specified rule is predicted, when a specified symbol is nulled, when a user-specified lexeme has been scanned, or when a user-specified lexeme is about to be scanned. A mid-rule event can be defined by adding a nulling symbol at the desired point in the rule, and defining an event which triggers when the symbol is nulled.

Ruby slippers parsing

Left-eideticism, efficient error recovery, and the event mechanism can be combined to allow the application to change the input in response to feedback from the parser. In traditional parser practice, error detection is an act of desperation. In contrast, Marpa's error detection is so painless that it can be used as the foundation of new parsing techniques.

For example, if a token is rejected, the lexer is free to create a new token in the light of the parser's expectations. This approach can be seen as making the parser's "wishes" come true, and I have called it "Ruby Slippers Parsing".

One use of the Ruby Slippers technique is to parse with a clean but oversimplified grammar, programming the lexical analyzer to make up for the grammar's short-comings on the fly. As part of Marpa::R2, the author has implemented an HTML parser, based on a grammar that assumes that all start and end tags are present. Such an HTML grammar is too simple even to describe perfectly standard-conformant HTML, but the lexical analyzer is programmed to supply start and end tags as requested by the parser. The result is a simple and cleanly designed parser that parses very liberal HTML and accepts all input files, in the worst case treating them as highly defective HTML.

Ambiguity as a language design technique

In current practice, ambiguity is avoided in language design. This is very different from the practice in the languages humans choose when communicating with each other. Human languages exploit ambiguity in order to design highly flexible, powerfully expressive languages. For example, the language of this monograph, English, is notoriously ambiguous.

Ambiguity of course can present a problem. A sentence in an ambiguous language may have undesired meanings. But note that this is not a reason to ban potential ambiguity --- it is only a problem with actual ambiguity.

Syntax errors, for example, are undesired, but nobody tries to design languages to make syntax errors impossible. A language in which every input was well-formed and meaningful would be cumbersome and even dangerous: all typos in such a language would be meaningful, and parser would never warn the user about errors, because there would be no such thing.

With Marpa, ambiguity can be dealt with in the same way that syntax errors are dealt with in current practice. The language can be designed to be ambiguous, but any actual ambiguity can be detected and reported at parse time. This exploits Marpa's ability to report exactly where and what the ambiguity is. Marpa::R2's own parser description language, the SLIF, uses ambiguity in this way.

Auto-generated languages

In 1973, Čulik and Cohen pointed out that the ability to efficiently parse LR-regular languages opens the way to auto-generated languages. In particular, Čulik and Cohen note that a parser which can parse any LR-regular language will be able to parse a language generated using syntax macros.

Second order languages

In the literature, the term "second order language" is usually used to describe languages with features which are useful for second-order programming. True second-order languages --- languages which are auto-generated from other languages --- have not been seen as practical, since there was no guarantee that the auto-generated language could be efficiently parsed.

With Marpa, this barrier is raised. As an example, Marpa::R2's own parser description language, the SLIF, allows "precedenced rules". Precedenced rules are specified in an extended BNF. The BNF extensions allow precedence and associativity to be specified for each RHS.

Marpa::R2's precedenced rules are implemented as a true second order language. The SLIF representation of the precedenced rule is parsed to create a BNF grammar which is equivalent, and which has the desired precedence. Essentially, the SLIF does a standard textbook transformation. The transformation starts with a set of rules, each of which has a precedence and an associativity specified. The result of the transformation is a set of rules in pure BNF. The SLIF's advantage is that it is powered by Marpa, and therefore the SLIF can be certain that the grammar that it auto-generates will parse in linear time.

Notationally, Marpa's precedenced rules are an improvement over similar features in LALR-based parser generators like yacc or bison. In the SLIF, there are two important differences. First, in the SLIF's precedenced rules, precedence is generalized, so that it does not depend on the operators: there is no need to identify operators, much less class them as binary, unary, etc. This more powerful and flexible precedence notation allows the definition of multiple ternary operators, and multiple operators with arity above three.

Second, and more important, a SLIF user is guaranteed to get exactly the language that the precedenced rule specifies. The user of the yacc equivalent must hope their syntax falls within the limits of LALR.

References, comments, etc.

Marpa has a semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: YAPC::Europe 2006 report: day 3

Header image by Tambako the Jaguar. Some rights reserved.