Ovid: Announcing Veure at The Perl Conference

I'm back from Romania and had a lovely time at YAPC::EU, er, The European Perl Conference, er, or this:

I unveiled that suggested logo at my opening keynote only to discover that many Perl devs had no idea what I was talking about. My sense of humor is shouting "get off my lawn."

However, I gave a lightning talk announcing Veure, including the game's name and blog! Veure is officially known as Tau Station and sign up for our newsletter to find out more, including keeping up with the alpha. Or just read our blog to see what's happening with it (but you really want to sign up for that newsletter).

Many thanks to Evozon for hosting a great conference!

Perl Foundation News: Perl 6 Performance and Reliability Grant Progress Report

Jonathan Worthington writes:

I have completed the initial 200 hours awarded under my Perl 6 performance and reliability engineering grant. This report summarizes what has been achieved in this time. I have also written a number of more detailed blog posts about my work.

Tooling

I implemented heap snapshots in MoarVM. This is a mechanism for taking recordings of what is in the heap after each garbage collection run. It can be used to understand the memory use of programs, but also to track down memory leaks. The snapshots are produced by passing the --profile=heap option when invoking Rakudo Perl 6. They can then be analyzed using a tool, which I implemented in Perl 6. It makes good use of both native arrays and parallel processing, and so also serves as a good example of a Perl 6 program processing a non-trivial volume of data. The data appears to be something of a goldmine, and following up on and addressing everything raised by it will probably keep us busy for a good while. It has already been used to track down memory leaks and fix them.

Performance

My performance work largely focused on lower-level improvements, rather than optimizing the Perl 6 built-ins (which are already receiving plenty of attention). Each improvement is annotated with the components affected. I:

  • Had the compiler code-generate accessor methods where possible, rather than them being closures added by a trait. This in turn made them inlinable. This made attribute accessors many times faster. (Rakudo)
  • Re-designed and re-implemented the MoarVM multiple dispatch cache, so it can handle named parameters. I made it more memory compact and faster to search along the way. With changes to Rakudo to take advantage of it, multiple dispatches involving named arguments got much faster. This most notably impacted constructs like @a[$i]:exists, which got around 20 times faster. (MoarVM, Rakudo)
  • Made a number of improvements to how return is implemented. This made it a real control exception, as per the language design documents. More notably, however, the changes made return a couple of times faster when used, cut the cost of nearly all routine invocations whether they used return or not, made it possible for more routines to be inlined, and reduced memory consumption. Further optimizations in this area will be possible thanks to these changes. (MoarVM, NQP, Rakudo)
  • Optimized throwing of next/last/redo control exceptions, making them much cheaper in the common case. (Rakudo)
  • Implemented lazily decoding the string heap, improving startup time. Gave 1.26MB less base Rakudo memory use, and shaved 2.7 million CPU cycles off startup. (MoarVM)
  • Fixed a string decoding performance bug that made reading very long lines extremely slow. (MoarVM)
  • Knocked 80% off get_boxed_ref, which is a hot path in Int math. (MoarVM)
  • Eliminated generating various unrequired decont operations at code-gen time. (NQP)
  • Significantly overhauled MoarVM's call frame handling, eliminating reference counts, simplifying memory management, fixing excessive GC time in programs that store a huge number of closures, and preparing the way for a number of future improvements. (MoarVM)
  • Avoided various bits of NULLing on frame entry and initialization, especially in specialized (optimized) frames. (MoarVM)
  • Fix bugs with a submitted patch that made serialization and compilation vastly faster for large compilation units (such as the Rakudo CORE.setting), so that it could be merged in. (MoarVM)
  • Various additional optimizations to invocation, that added up to shaving a couple of perfect off an invocation-heavy benchmark. (MoarVM)

Memory leak fixes and other memory use improvements

Aided by the heap analyzer described earlier in this report, along with tools from the Valgrind suite, I tracked down and fixed a number of memory leaks, and also reduced memory use.

  • Fixed a memory leak that could affect multi dispatch + constraints + flattening, and likely other situations. (MoarVM)
  • Elimianted near-unused static frame array in MVMCompUnit, saving some hundreds of kilobytes off the Rakudo base memory. (MoarVM)
  • Fixed a memory leak affecting EVAL and everything using it (some cases of regex interpolation were also impacted, for example). There were actually multiple problems, identified through using the heap analyzer. (NQP, MoarVM)
  • GC performance analysis and tuning when we do full collections, to improve memory behavior for various kinds of program. (MoarVM)
  • Eliminate caching of call contexts, reducing the size of all call frames and simplifying/cheapning context serialization. (MoarVM)
  • Eliminated retention of barely-used bytecode maps produced during validation and only initialized frame instrumentation state if needed, adding up to 3.5MB of savings on Rakudo's base memory. (MoarVM)

Concurrency bug fixes

A number of concurrency bugs were tracked down and eliminated, improving the reliability of programs using Perl 6's parallel and concurrent programming features.

  • Fixed a number of data races around thread spawning and the first GC run of the thread. (MoarVM)
  • Fixed mis-handling of spurious condition variable wake-ups in Promise.result. (Rakudo)
  • Fixed hang reported in RT #128628 by adding missing GC block/unblock around semaphore wait. (MoarVM)
  • Fixed deadlock that could occasionally occur in the concurrent blocking queue used for task scheduling. (MoarVM)
  • Added missing GC rooting around concurrency control constructs when marking themselves blocked/unblocked. (MoarVM)
  • Fix a race condition in the Channel.Supply coercer. (Rakudo)
  • Fix a circular waiting bug that led to occasional deadlocks in some uses of the supply and react syntax. (Rakudo)
  • Made .close of a listening socket tap await the actual shutdown of the socket, fixing a race that caused instability in the async socket tets. (MoarVM, Rakudo)
  • Give start blocks a fresh $/ and $!. (Rakudo)
  • Track down the problem with S17-lowlevel/lock.t sometimes failing/crashing; correct a bug in the test, resolving the problem. (Spectest)
  • Start re-working VMArray so mis-use of it across threads cannot cause crashes. (MoarVM)

Other assorted fixes

I fixed a selection of other problems, mostly coming from the RT bug queue. I've grouped them by the component that was primarily fixed.

Rakudo

  • RT #127548 (crash involving uint64 attribute code-gen)
  • RT #127660 (didn't pay attention to submethod Bool)
  • RT #127629 (issues with conveying exceptions in Supply <-> Channel coercions)
  • Fixed Mu.Str to use objectid, not memory address, eliminating some test instabilities
  • RT #127540 (anon subs triggering a bogus redeclaration error)
  • RT #128270 (mis-compilation of charset with ignoremark led to crashes, e.g. if used in combination with :g)
  • RT #128581 (poor error reporting for my Array[Numerix] $x)
  • RT #127749 (Seqs should not be stuck into the constants table)
  • RT #127785 (parse error if trying to use where in a unit sub MAIN signature)
  • RT #127473 (compiler explodes on (;), (;;), [0;] and similar)
  • RT #127394 (cannot write -> SomeSubtype:D $x { } as it produced a compiler error; now it works)
  • RT #128552 (missing $?MODULE and ::?MODULE symbols)
  • Fixed a hang in spectests on Windows, which ended up involving file lock mis-management in precompilation handling

MoarVM

  • RT #127530 (SEGV on certain concatenating certain characters)
  • RT #127272 (a JIT compilation bug in the string ge/le operators)
  • RT #123602 and RT #127782 (repeat + concat + substr interaction bug)
  • RT #126756 (SEGV on single utf8-c8 synthetic)
  • RT #127748 (SEGV due to a GC invariant violation, which led to memory corruption)
  • Fix various missing GC rootings and write barriers uncovered in stress testing

Other assorted tasks

Some time was spent on the following tasks:

  • Reviewed various pull requests to Rakudo/NQP/MoarVM, providing feedback and/or merging them as appropriate.
  • Got OSX Travis support for MoarVM setup, after a regression on OSX that got missed.
  • The odd bit of bug queue wrangling (merging duplicates, closing already fixed issues, rejecting things that are not bugs, etc.)

Sawyer X: Perl 5 Porters Mailing List Summary: August 22nd-28th

Hey everyone,

Following is the p5p (Perl 5 Porters) mailing list summary for the past week. Enjoy!

August 22nd-28th

News and highlights

Threads::Semaphore was upgraded in core to 2.13.

Grant reports

Issues

New issues

  • Perl #129048: lib/perlbug.t: avoid spurious failure when testing long PATH line.
  • Perl #129059: lexical subs - my sub using our sub segfaults.
  • Perl #129061: Valgrind: Buffer overrun in S_regmatch with pathological regular expression.
  • Perl #129068: SV *Perl_cv_const_sv_or_av: Assertion fail.
  • Perl #129069: Fuzzer-detected use-after-free in Perl_yylex.
  • Perl #129070: Refactor toke.c into smaller, more maintainable parts.
  • Perl #129071: Perl git repository not available over HTTPS.
  • Perl #129072: Typo in perlpodspec.
  • Perl #129073: Perl_yyparse: Assertion fail.
  • Perl #129087: Null ptr deref, segfault Perl_sv_setsv_flags.
  • Perl #129090: Perl_pad_fixup_inner_anons Null reference Memory corruption.
  • Perl #129098: Perl should have a cycle detector.
  • Perl #129106: null ptr deref, segfault Perl_sv_vcatpvfn_flags (sv.c:12398).

Reopened issues

This week a ticket was reopened because it was not yet resolved in blead.

  • Perl #129067: Use of inherited AUTOLOAD for non-method is deprecated.

Resolved issues

Rejected issues

  • Perl #129105: null ptr deref, segfault Perl_newSVpv (sv.c:9218).

Suggested patches

Dan Collins provided a patch in Perl #129058 ([PATCH] Perl_do_vop: resulting string isn't always null-terminated), relating to null-terminated strings.

Theo Buehler provided a patch in Perl #129102 (to update the man pages links for strlcat and strlcpy).

Discussion

Aristotle Pagaltzis provided an improvement of the original patch change to base.pm by Peter Rabbitson (Ribasushi). Another suggestion, raised by Todd Rinaldo, is to leave the module unchanged. This is supported by Peter, Aristotle, and Kent Fredric.

Martin Dyers raised (Storable.pm) a question about the licensing of Data::Dumper which does not contain its own license file. Should we add a file to each or should we just document the license in the POD?

Father Chrysostomos provided comments regarding the utf8 warnings in Encode.

Herbert Breunung asked for suggestion on documentation Pod files that could use help cleaning up, fixing, or generally improving.

Perl Foundation News: Maintaining Perl 5: Grant Report for July 2016

Tony Cook writes:

Approximately 27 tickets were reviewed, and 5 patches were applied

HoursActivity
5.02#126203 review code for leak issue, apply original patch,
find related issues, research
#126203 more related issues
#126203 email to jhi
0.95#127663 re-familiarize, consider options
21.05#127834 (sec) comments, fix some issues
#127834 (sec) customized updates, testing, comment with
new patchsets
#127834 (sec) update patch sets, proposed perldelta
#127834 (sec) review updates, research, comment
#127834 (sec) fix some issues, consider some options,
updates and comment
#127834 (sec) perldelta updates, comment
#127834 (sec) port forward to blead
#127834 (sec) finish port forward
#127834 (sec) upstream reports
#127834 (sec) more upstream
#127834 (sec) more upstream
#127834 (sec) finish upstream
#127834 (sec) fix PathTools version bug (blead)
0.40#128245 review, produce alternate patch and comment
0.60#128432 review, testing, apply to blead
5.62#128438, #128564 irc discussion, alternate patch, testing
#128438 testing, comment
#128438 testing, review
#128438 more testing, apply a fix
1.37#128445 research, testing and comment
0.52#128517 review change and consider alternate changes,
check smoke results and apply to blead
2.58#128524 review discussion, produce patch and comment
#128524 adjust test, testing, apply to blead
0.44#128574 comment
#128574 review, testing, push to smoke-me
0.33#128588 review discussion
0.25#128607 review discussion
0.15#128620 research, comment and close
4.67#128627 try to build with quadmath, Configure debugging
#128627 work out Configure, try to trace library inclusion
#128627 debugging, testing
0.83#128630 testing, review patches, comments
0.10#128673 research and comment
1.85#128685 try to work up a patch and comment
0.40#24000 research and comment
0.52#67424 comment
0.97look into darwin test failures
0.22look into khw locale configure probe issue
1.70more darwin test failures, debugging underlying cause,
simple fix
6.02more parallel gmake
more parallel gmake
more parallel gmake
1.97more parallel gmake, fix search order conflict, repeat testing
1.75more parallel gmake, more optimization, re-work deps
closer to previous
2.45more parallel gmake, polish, performance testing,
optimization
1.77more parallel gmake, post as ticket 128564

64.50 Hours Total

Perl Hacks: DamianWare

Yesterday at YAPC Europe I gave a talk called “Error(s) Free Programming”. The slides are below, but it might make more sense once the video is online.

The talk is about Damian Conway’s module Lingua::EN::Inflexion and how it makes programmers’ lives easier. As part of the talk, I invented a logo for the fictional DamianWare brand. DamianWare is, of course, a brand that specialises in using deep Perl magic in order to produce tools that help Perl programmers be lazier.

It was just a joke. A throwaway visual to make a point in the presentation. But after the talk Mallory approached me and suggested that the logo would look great on a t-shirt which was sold to benefit The Perl Foundation. I couldn’t really argue with that.

And, having emailed him overnight, it turns out that Damian agrees it’s a good idea too.

So the shirts (and a couple of other things) are now available on Spreadshirt (currently the UK version, I’ll try to make them more widely available as soon as possible).

There’s an easier to remember URL at http://perlhacks.com/damian.

Any profit that I make (and I think it’s about 20% of the sale price) will be donated to TPF as soon as I receive it.

The post DamianWare appeared first on Perl Hacks.

:: Luca Ferrari ::: Perl Magazine: Cultured Perl

A news from perlsphere.net: CulturedPerl community collaborative blog has been launched!
The idea is really interesting: have a nice online magazine about Perl.
I'm not a Perl expert, at least not enough to be a writer/author for such pubblication, but I will surely read on it.

Perl Foundation News: Maintaining the Perl 5 Core: Report for Month 34

Dave Mitchell writes:

I spent last month mainly working on "fuzzer" bug reports, and trying to process some of the backlog in my p5p mailbox.

Summary

1:45 "Confused by eval behavior" thread
1:21 [perl #127834] @INC issues
1:26 [perl #128241] Deprecate /$empty_string/
2:03 [perl #128253] Assert fail in S_find_uninit_var
1:19 [perl #128255] Assert fail in S_sublex_done
0:26 [perl #128257] Segfault in Perl_gv_setref
0:14 [perl #128258] Segfault due to stack overflow
3:16 fix build warnings and smoke failures
8:46 process p5p mailbox

20:36 Total (HH::MM)

As of 2016/07/31: since the beginning of the grant:

146.0 weeks
1988.7 total hours
13.6 average hours per week

There are 411 hours left on the grant (it having just been extended by 400 hours).

Perl Hacks: Cultured Perl

Back in about 2008, I set up a group blog called “Cultured Perl”. The idea was to have a blog that concentrated on the Perl community rather than the technical aspects that most Perl bloggers write about most of the time. It didn’t last very long though and after a few posts it quietly died. But the name “Cultured Perl” still appeals to my love of bad puns and I knew I would reuse it at some point.

At YAPC Europe 2010 in Pisa, I gave a lightning talk called Perl Vogue. It talked about the way the Perl modules come into fashion and often go out of fashion again very quickly. I suggested an online Perl magazine which would tell people which modules were fashionable each month. It was a joke, of course (not least because Vogue are famously defensive of their brand.

Over the last many years people have suggested that the Perl community needs to get “out of the echo chamber” and talk to people who aren’t part of the community. For example, instead of posting and answering Perl questions on a Perl-specific web site like Perl Monks, it’s better to do it on a general programming site like Stack Overflow.

Hold those three thoughts. “Cultured Perl”, online Perl magazine, getting out of the echo chamber.

Medium is a very popular blogging site. Many people have moved their blogging there and it’s a great community for writing, sharing and recommending long-form writing. I get a “recommended reading” email from Medium every day and it always contains links to several interesting articles.

Medium has two other features that interest me. Firstly, you can tag posts. So if you write a post about web development using Perl and tag it with “web dev” then it will be seen by anyone who is following the web dev tag. That’s breaking out of the echo chamber.

Secondly, Medium has “publications”. That is, you can bring a set of articles together under your own banner. Publication owners can style their front page in various ways to differentiate it from Medium’s default styling. Readers can subscribe to publications and they will then be notified of every article published in that publication. That’s an online magazine.

So I’ve set up a publication on Medium (called “Cultured Perl” – to complete the set of three ideas). My plan is to publish (or republish) top quality Perl articles so we slowly build a brand outside of the echo chamber where people know they can find all that is best in Perl writing.

If you write about Perl, please consider signing up to Medium, becoming a contributor to Cultured Perl and submitting your articles for publication. I’ll publish the best ones (and, hopefully, work with authors to improve the others so they are good enough to publish).

I’m happy to republish stuff from your other blogs. I’m not suggesting that we suddenly move all Perl blogging to Medium. For example, whenever I publish something on Perl Hacks, the post gets mirrored to a Perl Hacks publication that I set up on Medium earlier this year. There’s a WordPress to Medium plugin that does that automatically for me. There may well be similar tools for other blogging platforms (if you can’t find one for your blog – then Medium has an API so you could  write one).

If you are a reader, then please consider subscribing to Cultured Perl. And please recommend (by clicking on the heart symbol) any articles that you enjoy. The more recommendations that an article gets, the more likely it becomes that Medium will recommend it to other readers.

I have no idea how this will go, but over the next few months I hope to start by publishing four or five articles every week. Perhaps you could start by submitting articles about what a great time you had at YAPC Europe.

Oh, and here are the slides from the lightning talk I used to announce this project at YAPC Europe in Cluj-Napoca, Romania yesterday.

 

The post Cultured Perl appeared first on Perl Hacks.

Dave's Free Press: Journal: Module pre-requisites analyser

Dave's Free Press: Journal: CPANdeps

Dave's Free Press: Journal: Perl isn't dieing

Perlgeek.de : Continuous Delivery and Security

What's the impact of automated deployment on the security of your applications and infrastructure?

It turns out there are both security advantages, and things to be wary of.

The Dangers of Centralization

In a deployment pipeline, the machine that controls the deployment needs to have access to the target machines where the software is deployed.

In the simplest case, there is private SSH key on the deployment machine, and the target machines grant access to the owner of that key.

This is an obvious risk, since an attacker gaining access to the deployment machine (or in the examples discussed previously, the GoCD server controlling the machine) can use this key to connect to all of the target machines.

Some possible mitigations include:

  • hardened setup of the deployment machine
  • password-protect the SSH key and supply the password through the same channel that triggers the deployment
  • have separate deployment and build hosts. Build hosts tend to need far more software installed, which imply a bigger attack surface
  • on the target machines, only allow unprivileged access through said SSH key, and use something like sudo to allow only certain privileged operations

Each of these mitigations have their own costs and weaknesses. For example password-protecting SSH keys helps if the attacker only manages to obtain a copy of the file system, but not if the attacker gains root privileges on the machine, and thus can obtain a memory dump that includes the decrypted SSH key.

The sudo approach is very effective at limiting the spread of an attack, but it requires extensive configuration on the target machine, and you need a secure way to deploy that. So you run into a chicken-and-egg problem and have quite some extra effort.

On the flip side, if you don't have a delivery pipeline, deployments have to happen manually, so you have the same problem of needing to give humans access to the target machines. Most organizations offer some kind of secured host on which the operator's SSH keys are stored, and you face the same risk with that host as the deployment host.

Time to Market for Security Fixes

Compared to manual deployments, even a relatively slow deployment pipeline is still quite fast. When a vulnerability is identified, this quick and automated rollout process can make a big difference in reducing the time until the fix is deployed.

Equally important is the fact that a clunky manual release process seduces the operators into taking shortcuts around security fixes, skipping some steps of the quality assurance process. When that process is automated and fast, it is easier to adhere to the process than to skip it, so it will actually be carried out even in stressful situations.

Audits and Software Bill of Materials

A good deployment pipeline tracks when which version of a software was built and deployed. This allows one to answer questions such as "For how long did we have this security hole?", "How soon after the report was the vulnerability patched in production?" and maybe even "Who approved the change that introduced the vulnerability?".

If you also use configuration management based on files that are stored in a version control system, you can answer these questions even for configuration, not just for software versions.

In short, the deployment pipeline provides enough data for an audit.

Some legislations require you to record a Software Bill of Materials. This is a record of which components are contained in some software, for example a list of libraries and their versions. While this is important for assessing the impact of a license violation, it is also important for figuring out which applications are affected by a vulnerability in a particular version of a library.

For example, a 2015 report by HP Security found that 44% of the investigated breaches were made possible by vulnerabilities that have been known (and presumably patched) for at least two years. Which in turn means that you can nearly halve your security risk by tracking which software version you use where, subscribe to a newsletter or feed of known vulnerabilities, and rebuild and redeploy your software with patched versions.

A Continuous Delivery system doesn't automatically create such a Software Bill of Materials for you, but it gives you a place where you can plug in a system that does for you.

Conclusions

Continuous Delivery gives the ability to react quickly and predictably to newly discovered vulnerabilities. At the same time, the deployment pipeline itself is an attack surface, which, if not properly secured, can be quite an attractive target for an intruder.

Finally, the deployment pipeline can help you to collect data that can give insight into the usage of software with known vulnerabilities, allowing you to be thorough when patching these security holes.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 3

Perlgeek.de : Managing State in a Continuous Delivery Pipeline

Continuous Delivery is all nice and fluffy for a stateless application. Installing a new application is a simple task, which just needs installing of the new binaries (or sources, in case of a language that's not compiled), stop the old instance, and start a new instance. Bonus points for reversing the order of the last two steps, to avoid downtimes.

But as soon as there is persistent state to consider, things become more complicated.

Here I will consider traditional, relational databases with schemas. You can avoid some of the problems by using a schemaless "noSQL" database, but you don't always have that luxury, and it doesn't solve all of the problems anyway.

Along with the schema changes you have to consider data migrations, but they aren't generally harder to manage than schema changes, so I'm not going to consider them in detail.

Synchronization Between Code and Database Versions

State management is hard because code is usually tied to a version of the database schema. There are several cases where this can cause problems:

  • Database changes are often slower than application updates. In version 1 of your application can only deal with version 1 of the schema, and version 2 of the application can only deal with version 2 of the schema, you have to stop the application in version 1, do the database upgrade, and start up the application only after the database migration has finished.
  • Stepbacks become painful. Typically either a database change or its rollback can lose data, so you cannot easily do an automated release and stepback over these boundaries.

To elaborate on the last point, consider the case where a column is added to a table in the database. In this case the rollback of the change (deleting the column again) loses data. On the other side, if the original change is to delete a column, that step usually cannot be reversed; you can recreate a column of the same type, but the data is lost. Even if you archive the deleted column data, new rows might have been added to the table, and there is no restore data for these new rows.

Do It In Multiple Steps

There is no tooling that can solve these problems for you. The only practical approach is to collaborate with the application developers, and break up the changes into multiple steps (where necessary).

Suppose your desired change is to drop a column that has a NOT NULL constraint. Simply dropping the column in one step comes with the problems outlined above. In a simple scenario, you might be able to do the following steps instead:

  • Deploy a database change that makes the column nullable (or give it a default value)
  • Wait until you're sure you don't want to roll back to a version where this column is NOT NULL
  • Deploy a new version of the application that doesn't use the column anymore
  • Wait until you're sure you don't want to roll back to a version of your application that uses this column
  • Deploy a database change that drops the column entirely.

In a more complicated scenario, you might first need to a deploy a version of your application that can deal with reading NULL values from this column, even if no code writes NULL values yet.

Adding a column to a table works in a similar way:

  • Deploy a database change that adds the new column with a default value (or NULLable)
  • Deploy a version of the application that writes to the new column
  • optionally run some migrations that fills the column for old rows
  • optionally deploy a database change that adds constraints (like NOT NULL) that weren't possible at the start

... with the appropriate waits between the steps.

Prerequisites

If you deploy a single logical database change in several steps, you need to do maybe three or four separate deployments, instead of one big deployment that introduces both code and schema changes at once. That's only practical if the deployments are (at least mostly) automated, and if the organization offers enough continuity that you can actually actually finish the change process.

If the developers are constantly putting out fires, chances are they never get around to add that final, desired NOT NULL constraint, and some undiscovered bug will lead to missing information later down the road.

Tooling

Unfortunately, I know of no tooling that supports the inter-related database and application release cycle that I outlined above.

But there are tools that manage schema changes in general. For example sqitch is a rather general framework for managing database changes and rollbacks.

On the lower level, there are tools like pgdiff that compare the old and new schema, and use that to generate DDL statements that bring you from one version to the next. Such automatically generated DDLs can form the basis of the upgrade scripts that sqitch then manages.

Some ORMs also come with frameworks that promise to manage schema migrations for you. Carefully evaluate whether they allow rollbacks without losing data.

No Silver Bullet

There is no single solution that manages all your data migrations automatically for your during your deployments. You have to carefully engineer the application and database changes to decouple them a bit. This is typically more work on the application development side, but it buys you the ability to deploy and rollback without being blocked by database changes.

Tooling is available for some pieces, but typically not for the big picture. Somebody has to keep track of the application and schema versions, or automate that.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Dave's Free Press: Journal: Devel::CheckLib can now check libraries' contents

Ocean of Awareness: Top-down parsing is guessing

Top-down parsing is guessing. Literally. Bottom-up parsing is looking.

The way you'll often hear that phrased is that top-down parsing is looking, starting at the top, and bottom-up parsing is looking, starting at the bottom. But that is misleading, because the input is at the bottom -- at the top there is nothing to look at. A usable top-down parser must have a bottom-up component, even if that component is just lookahead.

A more generous, but still accurate, way to describe the top-down component of parsers is "prediction". And prediction is, indeed, a very useful component of a parser, when used in combination with other techniques.

Of course, if a parser does nothing but predict, it can predict only one input. Top-down parsing must always be combined with a bottom-up component. This bottom-up component may be as modest as lookahead, but it must be there or else top-down parsing is really not parsing at all.

So why is top-down parsing used so much?

Top-down parsing may be unusable in its pure form, but from one point of view that is irrelevant. Top-down parsing's biggest advantage is that it is highly flexible -- there's no reason to stick to its "pure" form.

A top-down parser can be written as a series of subroutine calls -- a technique called recursive descent. Recursive descent allows you to hook in custom-written bottom-up logic at every top-down choice point, and it is a technique which is completely understandable to programmers with little or no training in parsing theory. When dealing with recursive descent parsers, it is more useful to be a seasoned, far-thinking programmer than it is to be a mathematician. This makes recursive descent very appealing to seasoned, far-thinking programmers, and they are the audience that counts.

Switching techniques

You can even use the flexibility of top-down to switch away from top-down parsing. For example, you could claim that a top-down parser could do anything my own parser (Marpa) could do, because a recursive descent parser can call a Marpa parser.

A less dramatic switchoff, and one that still leaves the parser with a good claim to be basically top-down, is very common. Arithmetic expressions are essential for a computer language. But they are also among the many things top-down parsing cannot handle, even with ordinary lookahead. Even so, most computer languages these days are parsed top-down -- by recursive descent. These recursive descent parsers deal with expressions by temporarily handing control over to an bottom-up operator precedence parser. Neither of these parsers is extremely smart about the hand-over and hand-back -- it is up to the programmer to make sure the two play together nicely. But used with caution, this approach works.

Top-down parsing and language-oriented programming

But what about taking top-down methods into the future of language-oriented programming, extensible languages, and grammars which write grammars? Here we are forced to confront the reality -- that the effectiveness of top-down parsing comes entirely from the foreign elements that are added to it. Starting from a basis of top-down parsing is literally starting with nothing. As I have shown in more detail elsewhere, top-down techniques simply do not have enough horsepower to deal with grammar-driven programming.

Perl 6 grammars are top-down -- PEG with lots of extensions. These extensions include backtracking, backtracking control, a new style of tie-breaking and lots of opportunity for the programmer to intervene and customize everything. But behind it all is a top-down parse engine.

One aspect of Perl 6 grammars might be seen as breaking out of the top-down trap. That trick of switching over to a bottom-up operator precedence parser for expressions, which I mentioned above, is built into Perl 6 and semi-automated. (I say semi-automated because making sure the two parsers "play nice" with each other is not automated -- that's still up to the programmer.)

As far as I know, this semi-automation of expression handling is new with Perl 6 grammars, and it may prove handy for duplicating what is done in recursive descent parsers. But it adds no new technique to those already in use. And features like

  • mulitple types of expression, which can be told apart based on their context,
  • n-ary expressions for arbitrary n, and
  • the autogeneration of multiple rules, each allowing a different precedence scheme, for expressions of arbitrary arity and associativity,

all of which are available and in current use in Marpa, are impossible for the technology behind Perl 6 grammars.

I am a fan of the Perl 6 effort. Obviously, I have doubts about one specific set of hopes for Perl 6 grammars. But these hopes have not been central to the Perl 6 effort, and I will be an eager student of the Perl 6 team's work over the coming months.

Comments

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Perlgeek.de : Moritz on Continuous Discussions (#c9d9)

On Tuesday I was a panelist on the Continuous Discussions Episode 47 – Open Source and DevOps. It was quite some fun!

Much of the discussion applied to Open Source in general software development, not just DevOps.

You can watch the full session on Youtube, or on the Electric Cloud blog.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Dave's Free Press: Journal: I Love Github

Dave's Free Press: Journal: Palm Treo call db module

Perlgeek.de : Automating Deployments: Building in the Pipeline

The first step of an automated deployment system is always the build. (For a software that doesn't need a build to be tested, the test might come first, but stay with me nonetheless).

At this point, I assume that there is already a build system in place that produces packages in the desired format, here .deb files. Here I will talk about integrating this build step into a pipeline that automatically polls new versions from a git repository, runs the build, and records the resulting .deb package as a build artifact.

A GoCD Build Pipeline

As mentioned earlier, my tool of choice of controlling the pipeline is Go Continuous Delivery. Once you have it installed and configured an agent, you can start to create a pipeline.

GoCD lets you build pipelines in its web interface, which is great for exploring the available options. But for a blog entry, it's easier to look at the resulting XML configuration, which you can also enter directly ("Admin" → "Config XML").

So without further ado, here's the first draft:


  <pipelines group="deployment">
    <pipeline name="package-info">
      <materials>
        <git url="https://github.com/moritz/package-info.git" dest="package-info" />
      </materials>
      <stage name="build" cleanWorkingDir="true">
        <jobs>
          <job name="build-deb" timeout="5">
            <tasks>
              <exec command="/bin/bash" workingdir="package-info">
                <arg>-c</arg>
                <arg>debuild -b -us -uc</arg>
              </exec>
            </tasks>
            <artifacts>
              <artifact src="package-info*_*" dest="package-info/" />
            </artifacts>
          </job>
        </jobs>
      </stage>
    </pipeline>
  </pipelines>

The outer-most group is a pipeline group, which has a name. It can be used to make it easier to get an overview of available pipelines, and also to manage permissions. Not very interesting for now.

The second level is the <pipeline> with a name, and it contains a list of materials and one or more stages.

Materials

A material is anything that can trigger a pipeline, and/or provide files that commands in a pipeline can work with. Here the only material is a git repository, which GoCD happily polls for us. When it detects a new commit, it triggers the first stage in the pipeline.

Directory Layout

Each time a job within a stage is run, the go agent (think worker) which runs it prepares a directory in which it makes the materials available. On linux, this directory defaults to /var/lib/go-agent/pipelines/$pipline_name. Paths in the GoCD configuration are typically relative to this path.

For example the material definition above contains the attribute dest="package-info", so the absolute path to this git repository is /var/lib/go-agent/pipelines/package-info/package-info. Leaving out the dest="..." works, and gives on less level of directory, but only works for a single material. It is a rather shaky assumption that you won't need a second material, so don't do that.

See the config references for a list of available material types and options. Plugins are available that add further material types.

Stages

All the stages in a pipeline run serially, and each one only if the previous stage succeed. A stage has a name, which is used both in the front end, and for fetching artifacts.

In the example above, I gave the stage the attribute cleanWorkingDir="true", which makes GoCD delete files created during the previous build, and discard changes to files under version control. This tends to be a good option to use, otherwise you might unknowingly slide into a situation where a previous build affects the current build, which can be really painful to debug.

Jobs, Tasks and Artifacts

Jobs are potentially executed in parallel within a stage, and have names for the same reasons that stages do.

Inside a job there can be one or more tasks. Tasks are executed serially within a job. I tend to mostly use <exec> tasks (and <fetchartifact>, which I will cover in a later blog post), which invoke system commands. They follow the UNIX convention of treating an exit status of zero as success, and everything else as a failure.

For more complex commands, I create shell or Perl scripts inside a git repository, and add repository as a material to the pipeline, which makes them available during the build process with no extra effort.

The <exec> task in our example invokes /bin/bash -c 'debuild -b -us -uc'. Which is a case of Cargo Cult Programming, because invoking debuild directly works just as well. Ah well, will revise later.

debuild -b -us -uc builds the Debian package, and is executed inside the git checkout of the source. It produces a .deb file, a .changes file and possibly a few other files with meta data. They are created one level above the git checkout, so in the root directory of the pipeline run.

These are the files that we want to work with later on, we let GoCD store them in an internal database. That's what the <artifact> instructs GoCD to do.

Since the name of the generated files depends on the version number of the built Debian package (which comes from the debian/changelog file in the git repo), it's not easy to reference them by name later on. That's where the dest="package-info/" comes in play: it makes GoCD store the artifacts in a directory with a fixed name. Later stages then can retrieve all artifact files from this directory by the fixed name.

The Pipeline in Action

If nothing goes wrong (and nothing ever does, right?), this is roughly what the web interface looks like after running the new pipeline:

So, whenever there is a new commit in the git repository, GoCD happily builds a Debian pacckage and stores it for further use. Automated builds, yay!

But there is a slight snag: It recycles version numbers, which other Debian tools are very unhappy about. In the next blog post, I'll discuss a way to deal with that.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Perlgeek.de : Continuous Delivery on your Laptop

An automated deployment system, or delivery pipeline, builds software, and moves it through the various environments, like development, testing, staging, and production.

But what about testing and developing the delivery system itself? In which environment do you develop new features for the pipeline?

Start Small

When you are starting out you can likely get away with having just one environment for the delivery pipeline: the production environment.

It might shock you that you're supposed to develop anything in the production environment, but you should also be aware that the delivery system is not crucial for running your production applications, "just" for updating it. If the pipeline is down, your services still work. And you structure the pipeline to do the same jobs both in the testing and in the production environment, so you test the deployments in a test environment first.

A Testing Environment for the Delivery Pipeline?

If those arguments don't convince you, or you're at a point where developer productivity suffers immensely from an outage of the deployment system, you can consider creating a testing environment for the pipeline itself.

But pipelines in this testing environment should not be allowed to deploy to the actual production environment, and ideally shouldn't interfere with the application testing environment either. So you have to create at least a partial copy of your usual environments, just for testing the delivery pipeline.

This is only practical if you have automated basically all of the configuration and provisioning, and have access to some kind of cloud solution to provide you with the resources you need for this endeavour.

Creating a Playground

If you do decide that you do need some playground or testing environment for your delivery pipeline, there are a few options at your disposal. But before you build one, you should be aware of how many (or few) resources such an environment consumes.

Resource Usage of a Continuous Delivery Playground

For a minimal playground that builds a system similar to the one discussed in earlier blog posts, you need

  • a machine on which you run the GoCD server
  • a machine on which you run a GoCD agent
  • a machine that acts as the testing environment
  • a machine that acts as the production environment

You can run the GoCD server and agent on the same machine if you wish, which reduces the footprint to three machines.

The machine on which the GoCD server runs should have between one and two gigabytes of memory, and one or two (virtual) CPUs. The agent machine should have about half a GB of memory, and one CPU. If you run both server and agent on the same machine, two GB of RAM and two virtual CPUs should do nicely.

The specifications of the remaining two machines mostly depend on the type of applications you deploy and run on them. For the deployment itself you just need an SSH server running, which is very modest in terms of memory and CPU usage. If you stick to the example applications discussed in this blog series, or similarly lightweight applications, half a GB of RAM and a single CPU per machine should be sufficient. You might get away with less RAM.

So in summary, the minimal specs are:

  • One VM with 2 GB RAM and 2 CPUs, for go-server and go-agent
  • Two VMs with 0.5 GB RAM and 1 CPU each, for the "testing" and the "production" environments.

In the idle state, the GoCD server periodically polls the git repos, and the GoCD agent polls the server for work.

When you are not using the playground, you can shut off those processes, or even the whole machines.

Approaches to Virtualization

These days, almost nobody buys server hardware and runs such test machines directly on them. Instead there is usually a layer of virtualization involved, which both makes new operating system instances more readily available, and allows a denser resource utilization.

Private Cloud

If you work in a company that has its own private cloud, for example an OpenStack installation, you could use that to create a few virtual machines.

Public Cloud

Public cloud compute solutions, such as Amazon's EC2, Google's Compute Engine and Microsoft's Azure cloud offerings, allow you to create VM instances on demand, and be billed at an hourly rate. On all three services, you pay less than 0.10 USD per hour for an instance that can run the GoCD server[^pricedate].

[^pricedate]: Prices from July 2016, though I expect prices to only go downwards. Though resource usage of the software might increase in future as well.

Google Compute Engine even offers heavily discounted preemtible VMs. Those VMs are only available when the provider has excess resources, and come with the option to be shut down on relatively short notice (a few minutes). While this is generally not a good idea for an always-on production system, it can be a good fit for a cheap testing environment for a delivery pipeline.

Local Virtualization Solutions

If you have a somewhat decent workstation or laptop, you likely have sufficient resources to run some kind of virtualization software directly on it.

Instead of classical virtualization solutions, you could also use a containerization solution such as Docker, which provides enough isolation for testing a Continuous Delivery pipeline. The downside is that Docker is not meant for running several services in one container, and here you need at least an SSH server and the actual services that are being deployed. You could work around this by using Ansible's Docker connector instead of SSH, but then you make the testing playground quite dissimilar from the actual use case.

So let's go with a more typical virtualization environment such as KVM or VirtualBox, and Vagrant as a layer above them to automate the networking and initial provisioning. For more on this approach, see the next section.

Continuous Delivery on your Laptop

My development setup looks like this: I have the GoCD server installed on my Laptop running under Ubuntu, though running it under Windows or MacOS would certainly also work.

Then I have Vagrant installed, using the VirtualBox backend. I configure it to run three VMs for me: one for the GoCD agent, and one each as a testing and production machine. Finally there's an Ansible playbook that configures the three latter machines.

While running the Ansible playbook for configuring these three virtual machines requires internet connectivity, developing and testing the Continuous Delivery process does not.

If you want to use the same test setup, consider using the files from the playground directory of the deployment-utils repository, which will likely be kept more up-to-date than this blog post.

Network and Vagrant Setup

We'll use Vagrant with a private network, which allows you to talk to each of the virtual machines from your laptop or workstation, and vice versa.

I've added these lines to my /etc/hosts file. This isn't strictly necessary, but it makes it easier to talk to the VMs:

# Vagrant
172.28.128.1 go-server.local
172.28.128.3 testing.local
172.28.128.4 production.local
172.28.128.5 go-agent.local

And a few lines to my ~/.ssh/config file:

Host 172.28.128.* *.local
    User root
    StrictHostKeyChecking no
    IdentityFile /dev/null
    LogLevel ERROR

Do not do this for production machines. This is only safe on a virtual network on a single machine, where you can be sure that no attacker is present, unless they already compromised your machine.

That said, creating and destroying VMs is common in Vagrant land, and each time you create them anew, the will have new host keys. Without such a configuration, you'd spend a lot of time updating SSH key fingerprints.

Then let's get Vagrant:

$ apt-get install -y vagrant virtualbox

To configure Vagrant, you need a Ruby script called Vagrantfile:

# -*- mode: ruby -*-
# vi: set ft=ruby :

Vagrant.configure(2) do |config|
  config.vm.box = "debian/contrib-jessie64"

  {
    'testing'    => "172.28.128.3",
    'production' => "172.28.128.4",
    'go-agent'   => "172.28.128.5",
  }.each do |name, ip|
    config.vm.define name do |instance|
        instance.vm.network "private_network", ip: ip
        instance.vm.hostname = name + '.local'
    end
  end

  config.vm.synced_folder '/datadisk/git', '/datadisk/git'

  config.vm.provision "shell" do |s|
    ssh_pub_key = File.readlines("#{Dir.home}/.ssh/id_rsa.pub").first.strip
    s.inline = <<-SHELL
      mkdir -p /root/.ssh
      echo #{ssh_pub_key} >> /root/.ssh/authorized_keys
    SHELL
  end
end

This builds three Vagrant VMs based on the debian/contrib-jessie64 box, which is mostly a pristine Debian Jessie VM, but also includes a file system driver that allows Vagrant to make directories from the host system available to the guest system.

I have a local directory /datadisk/git in which I keep a mirror of my git repositories, so that both the GoCD server and agent can access the git repositories without requiring internet access, and without needing another layer of authentication. The config.vm.synced_folder call in the Vagrant file above replicates this folder into the guest machines.

Finally the code reads an SSH public key from the file ~/.ssh/config and adds it to the root account on the guest machines. In the next step, an Ansible playbook will use this access to configure the VMs to make them ready for the delivery pipeline.

To spin up the VMs, type

$ vagrant up

in the folder containing the Vagrantfile. The first time you run this, it takes a bit longer because Vagrant needs to download the base image first.

Once that's finished, you can call the command vagrant status to see if everything works, it should look like this:

$ vagrant status
Current machine states:

testing                   running (virtualbox)
production                running (virtualbox)
go-agent                  running (virtualbox)

This environment represents multiple VMs. The VMs are all listed
above with their current state. For more information about a specific
VM, run `vagrant status NAME`.

And (on Debian-based Linux systems) you should be able to see the newly created, private network:

$ ip route | grep vboxnet
172.28.128.0/24 dev vboxnet1  proto kernel  scope link  src 172.28.128.1

You should now be able to log in to the VMs with ssh root@go-agent.local, and the same with testing.local and production.local as host names.

Ansible Configuration for the VMs

It's time to configure the Vagrant VMs. Here's an Ansible playbook that does this:

---
 - hosts: go-agent
   vars:
     go_server: 172.28.128.1
   tasks:
   - group: name=go system=yes
   - name: Make sure the go user has an SSH key
     user: name=go system=yes group=go generate_ssh_key=yes home=/var/go
   - name: Fetch the ssh public key, so we can later distribute it.
     fetch: src=/var/go/.ssh/id_rsa.pub dest=go-rsa.pub fail_on_missing=yes flat=yes
   - apt: package=apt-transport-https state=installed
   - apt_key: url=https://download.go.cd/GOCD-GPG-KEY.asc state=present validate_certs=no
   - apt_repository: repo='deb https://download.go.cd /' state=present
   - apt: update_cache=yes package={{item}} state=installed
     with_items:
      - go-agent
      - git

   - copy:
       src: files/guid.txt
       dest: /var/lib/go-agent/config/guid.txt
       owner: go
       group: go
   - lineinfile: dest=/etc/default/go-agent regexp=^GO_SERVER= line=GO_SERVER={{ go_server }}
   - service: name=go-agent enabled=yes state=started

 - hosts: aptly
   handlers:
    - name: restart lighttpd
      service: name=lighttpd state=restarted
   tasks:
     - apt: package={{item}} state=installed
       with_items:
        - ansible
        - aptly
        - build-essential
        - curl
        - devscripts
        - dh-systemd
        - dh-virtualenv
        - gnupg2
        - libjson-perl
        - python-setuptools
        - lighttpd
        - rng-tools
     - copy: src=files/key-control-file dest=/var/go/key-control-file
     - command: killall rngd
       ignore_errors: yes
       changed_when: False
     - command: rngd -r /dev/urandom
       changed_when: False
     - command: gpg --gen-key --batch /var/go/key-control-file
       args:
         creates: /var/go/.gnupg/pubring.gpg
       become_user: go
       become: true
       changed_when: False
     - shell: gpg --export --armor > /var/go/pubring.asc
       args:
         creates: /var/go/pubring.asc
       become_user: go
       become: true
     - fetch:
         src: /var/go/pubring.asc
         dest: =deb-key.asc
         fail_on_missing: yes
         flat: yes
     - name: Bootstrap the aptly repos that will be configured on the `target` machines
       copy:
        src: ../add-package
        dest: /usr/local/bin/add-package
        mode: 0755
     - name: Download an example package to fill the repo with
       get_url:
        url: http://ftp.de.debian.org/debian/pool/main/b/bash/bash_4.3-11+b1_amd64.deb
        dest: /tmp/bash_4.3-11+b1_amd64.deb
     - command: /usr/local/bin/add-package {{item}} jessie /tmp/bash_4.3-11+b1_amd64.deb
       args:
           creates: /var/go/aptly/{{ item }}-jessie.conf
       with_items:
         - testing
         - production
       become_user: go
       become: true

     - name: Configure lighttpd to serve the aptly directories
       copy: src=files/lighttpd.conf dest=/etc/lighttpd/conf-enabled/30-aptly.conf
       notify:
         - restart lighttpd
     - service: name=lighttpd state=started enabled=yes

 - hosts: target
   tasks:
     - authorized_key:
        user: root
        key: "{{ lookup('file', 'go-rsa.pub') }}"
     - apt_key: data="{{ lookup('file', 'deb-key.asc') }}" state=present

 - hosts: production
   tasks:
     - apt_repository:
         repo: "deb http://{{hostvars['agent.local']['ansible_ssh_host'] }}/debian/production/jessie jessie main"
         state: present

 - hosts: testing
   tasks:
     - apt_repository:
         repo: "deb http://{{hostvars['agent.local']['ansible_ssh_host'] }}/debian/testing/jessie jessie main"
         state: present

 - hosts: go-agent
   tasks:
     - name: 'Checking SSH connectivity to {{item}}'
       become: True
       become_user: go
       command: ssh -o StrictHostkeyChecking=No root@"{{ hostvars[item]['ansible_ssh_host'] }}" true
       changed_when: false
       with_items: groups['target']

You also need a hosts or inventory file:

[all:vars]
ansible_ssh_user=root

[go-agent]
agent.local ansible_ssh_host=172.28.128.5

[aptly]
agent.local

[target]
testing.local ansible_ssh_host=172.28.128.3
production.local ansible_ssh_host=172.28.128.4

[testing]
testing.local

[production]
production.local

... and a small ansible.cfg file:

[defaults]
host_key_checking = False
inventory = hosts
pipelining=True

This does a whole lot of stuff:

  • Install and configure the GoCD agent
    • copies a file with a fixed UID to the configuration directory of the go-agent, so that when you tear down the machine and create it anew, the GoCD server will identify it as the same agent as before.
  • Gives the go user on the go-agent machine SSH access on the target hosts by
    • first making sure the go user has an SSH key
    • copying the public SSH key to the host machine
    • later distributing it to the target machine using the authorized_key module
  • Creates a GPG key pair for the go user
    • since GPG key creation uses lots of entropy for random numbers, and VMs typically don't have that much entropy, first install rng-tools and use that to convince the system to use lower-quality randomness. Again, this is something you shouldn't do on a production setting.
  • Copies the public key of said GPG key pair to the host machine, and then distribute it to the target machines using the apt_key module
  • Creates some aptly-based Debian repositories on the go-agent machine by
    • copying the add-package script from the same repository to the go-agent machine
    • running it with a dummy package, here bash, to actually create the repos
    • installing and configuring lighttpd to serve these packages by HTTP
    • configuring the target machines to use these repositories as a package source
  • Checks that the go user on the go-agent machine can indeed reach the other VMs via SSH

After running ansible-playbook setup.yml, your local GoCD server should have a new agent, which you have to activate in the web configuration and assign the appropriate resources (debian-jessie and aptly if you follow the examples from this blog series).

Now when you clone your git repos to /datadisk/git/ (be sure to git clone --mirror) and configure the pipelines on the GoCD server, you have a complete Continuous Delivery-system running on one physical machine.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Dave's Free Press: Journal: Graphing tool

Dave's Free Press: Journal: Travelling in time: the CP2000AN

Dave's Free Press: Journal: XML::Tiny released

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 1

Ocean of Awareness: Parsing: an expanded timeline

The fourth century BCE: In India, Pannini creates a sophisticated description of the Sanskrit language, exact and complete, and including pronunciation. Sanskrit could be recreated using nothing but Pannini's grammar. Pannini's grammar is probably the first formal system of any kind, predating Euclid. Even today, nothing like it exists for any other natural language of comparable size or corpus. Pannini is the object of serious study today. But in the 1940's and 1950's Pannini is almost unknown in the West. His work has no direct effect on the other events in this timeline.

1943: Emil Post defines and studies a formal rewriting system using productions. With this, the process of reinventing Pannini in the West begins.

1948: Claude Shannon publishes the foundation paper of information theory. Andrey Markov's finite state processes are used heavily.

1952: Grace Hopper writes a linker-loader and describes it as a "compiler". She seems to be the first person to use this term for a computer program. Hopper uses the term "compiler" in its original sense: "something or someone that brings other things together".

1954: At IBM, a team under John Backus begins working on the language which will be called FORTRAN. The term "compiler" is still being used in Hopper's looser sense, instead of its modern one. In particular, there is no implication that the output of a "compiler" is ready for execution by a computer. The output of one 1954 "compiler", for example, produces relative addresses, which need to be translated by hand before a machine can execute them.

1955: Noam Chomsky is awarded a Ph.D. in linguistics and accepts a teaching post at MIT. MIT does not have a linguistics department and Chomsky, in his linguistics course, is free to teach his own approach, highly original and very mathematical.

1956: Chomsky publishes the paper which is usually considered the foundation of Western formal language theory. The paper advocates a natural language approach that involves

  • a bottom layer, using Markov's finite state processes;
  • a middle, syntactic layer, using context-free grammars and context-sensitive grammars; and
  • a top layer, which involves mappings or "transformations" of the output of the syntactic layer.

These layers resemble, and will inspire, the lexical, syntactic and AST transformation phases of modern parsers. For finite state processes, Chomsky acknowledges Markov. The other layers seem to be Chomsky's own formulations -- Chomsky does not cite Post's work.

1957: Steven Kleene discovers regular expressions, a very handy notation for Markov's processes. Regular expressions turn out to describe exactly the mathematical objects being studied as finite state automata, as well as some of the objects being studied as neural nets.

1957: Noam Chomsky publishes Syntactic Structures, one of the most influential books of all time. The orthodoxy in 1957 is structural linguistics which argues, with Sherlock Holmes, that "it is a capital mistake to theorize in advance of the facts". Structuralists start with the utterances in a language, and build upward.

But Chomsky claims that without a theory there are no facts: there is only noise. The Chomskyan approach is to start with a grammar, and use the corpus of the language to check its accuracy. Chomsky's approach will soon come to dominate linguistics.

1957: Backus's team makes the first FORTRAN compiler available to IBM customers. FORTRAN is the first high-level language that will find widespread implementation. As of this writing, it is the oldest language that survives in practical use. FORTRAN is a line-by-line language and its parsing is primitive.

1958: John McCarthy's LISP appears. LISP goes beyond the line-by-line syntax -- it is recursively structured. But the LISP interpreter does not find the recursive structure: the programmer must explicitly indicate the structure herself, using parentheses.

1959: Backus invents a new notation to describe the IAL language (aka ALGOL). Backus's notation is influenced by his study of Post -- he seems not to have read Chomsky until later.

1960: Peter Naur improves the Backus notation and uses it to describe ALGOL 60. The improved notation will become known as Backus-Naur Form (BNF).

1960: The ALGOL 60 report specifies, for the first time, a block structured language. ALGOL 60 is recursively structured but the structure is implicit -- newlines are not semantically significant, and parentheses indicate syntax only in a few specific cases. The ALGOL compiler will have to find the structure. It is a case of 1960's optimism at its best. As the ALGOL committee is well aware, a parsing algorithm capable of handling ALGOL 60 does not yet exist. But the risk they are taking will soon pay off.

1960: A.E. Gleenie publishes his description of a compiler-compiler. Glennie's "universal compiler" is more of a methodology than an implementation -- the compilers must be written by hand. Glennie credits both Chomsky and Backus, and observes that the two notations are "related". He also mentions Post's productions. Glennie may have been the first to use BNF as a description of a procedure instead of as the description of a Chomsky grammar. Glennie points out that the distinction is "important".

Chomskyan BNF and procedural BNF: BNF, when used as a Chomsky grammar, describes a set of strings, and does not describe how to parse strings according to the grammar. BNF notation, if used to describe a procedure, is a set of instructions, to be tried in some order, and used to process a string. Procedural BNF describes a procedure first, and a language only indirectly.

Both procedural and Chomskyan BNF describe languages, but usually not the same language. That is,

  • Suppose D is some BNF description.
  • Let P(D) be D interpreted as a procedure,
  • Let L(P(D)) be the language which the procedure P(D) parses.
  • Let G(D) be D interpreted as a Chomsky grammar.
  • Let L(G(D)) be the language which the grammar G(D) describes.
  • Then, usually, L(P(D)) != L(G(D)).

The pre-Chomskyan approach, using procedural BNF, is far more natural to someone trained as a computer programmer. The parsing problem appears to the programmer in the form of strings to be parsed, exactly the starting point of procedural BNF and pre-Chomsky parsing.

Even when the Chomskyan approach is pointed out, it does not at first seem very attractive. With the pre-Chomskyan approach, the examples of the language more or less naturally lead to a parser. In the Chomskyan approach the programmer has to search for an algorithm to parse strings according to his grammar -- and the search for good algorithms to parse Chomskyan grammars has proved surprisingly long and difficult. Handling semantics is more natural with a Chomksyan approach. But, using captures, semantics can be added to a pre-Chomskyan parser and, with practice, this seems natural enough.

Despite the naturalness of the pre-Chomskyan approach to parsing, we will find that the first fully-described automated parsers are Chomskyan. This is a testimony to Chomsky's influence at the time. We will also see that Chomskyan parsers have been dominant ever since.

1961: In January, Ned Irons publishes a paper describing his ALGOL 60 parser. It is the first paper to fully describe any parser. The Irons algorithm is Chomskyan and top-down with a "left corner" element. The Irons algorithm is general, meaning that it can parse anything written in BNF. It is syntax-driven (aka declarative), meaning that the parser is actually created from the BNF -- the parser does not need to be hand-written.

1961: Peter Lucas publishes the first description of a purely top-down parser. This can be considered to be recursive descent, though in Lucas's paper the algorithm has a syntax-driven implementation, useable only for a restricted class of grammars. Today we think of recursive descent as a methodology for writing parsers by hand. Hand-coded approaches became more popular in the 1960's due to three factors:

  • Memory and CPU were both extremely limited. Hand-coding paid off, even when the gains were small.
  • Non-hand coded top-down parsing, of the kind Lucas's syntax-driven approach allowed, is a very weak parsing technique. It was (and still is) often necessary to go beyond its limits.
  • Top-down parsing is intuitive -- it essentially means calling subroutines. It therefore requires little or no knowledge of parsing theory. This makes it a good fit for hand-coding.

1963: L. Schmidt, Howard Metcalf, and Val Schorre present papers on syntax-directed compilers at a Denver conference.

1964: Schorre publishes a paper on the Meta II "compiler writing language", summarizing the papers of the 1963 conference. Schorre cites both Backus and Chomsky as sources for Meta II's notation. Schorre notes that his parser is "entirely different" from that of Irons 1961 -- in fact it is pre-Chomskyan. Meta II is a template, rather than something that readers can use, but in principle it can be turned into a fully automated compiler-compiler.

1965: Don Knuth invents LR parsing. The LR algorithm is deterministic, Chomskyan and bottom-up, but it is not thought to be practical. Knuth is primarily interested in the mathematics.

1968: Jay Earley invents the algorithm named after him. Like the Irons algorithm, Earley's algorithm is Chomskyan, syntax-driven and fully general. Unlike the Irons algorithm, it does not backtrack. Earley's algorithm is both top-down and bottom-up at once -- it uses dynamic programming and keeps track of the parse in tables. Earley's approach makes a lot of sense and looks very promising indeed, but there are three serious issues:

  • First, there is a bug in the handling of zero-length rules.
  • Second, it is quadratic for right recursions.
  • Third, the bookkeeping required to set up the tables is, by the standards of 1968 hardware, daunting.

1969: Frank DeRemer describes a new variant of Knuth's LR parsing. DeRemer's LALR algorithm requires only a stack and a state table of quite manageable size. LALR looks practical.

1969: Ken Thompson writes the "ed" editor as one of the first components of UNIX. At this point, regular expressions are an esoteric mathematical formalism. Through the "ed" editor and its descendants, regular expressions will become an everyday part of the working programmer's toolkit.

Recognizers: In comparing algorithms, it can be important to keep in mind whether they are recognizers or parsers. A recognizer is a program which takes a string and produces a "yes" or "no" according to whether a string is in part of a language. Regular expressions are typically used as recognizers. A parser is a program which takes a string and produces a tree reflecting its structure according to a grammar. The algorithm for a compiler clearly must be a parser, not a recognizer. Recognizers can be, to some extent, used as parsers by introducing captures.

1972: Alfred Aho and Jeffrey Ullman publish a two volume textbook summarizing the theory of parsing. This book is still important. It is also distressingly up-to-date -- progress in parsing theory slowed dramatically after 1972. Aho and Ullman describe a straightforward fix to the zero-length rule bug in Earley's original algorithm. Unfortunately, this fix involves adding even more bookkeeping to Earley's.

1972: Under the names TDPL and GTDPL, Aho and Ullman investigate the non-Chomksyan parsers in the Schorre lineage. They note that "it can be quite difficult to determine what language is defined by a TDPL parser". That is, GTDPL parsers do whatever they do, and that whatever is something the programmer in general will not be able to describe. The best a programmer can usually do is to create a test suite and fiddle with the GTDPL description until it passes. Correctness cannot be established in any stronger sense. GTDPL is an extreme form of the old joke that "the code is the documentation" -- with GTDPL nothing documents the language of the parser, not even the code.

GTDPL's obscurity buys nothing in the way of additional parsing power. Like all non-Chomskyan parsers, GTDPL is basically a extremely powerful recognizer. Pressed into service as a parser, it is comparatively weak. As a parser, GTDPL is essentially equivalent to Lucas's 1961 syntax-driven algorithm, which was in turn a restricted form of recursive descent.

At or around this time, rumor has it that the main line of development for GTDPL parsers is classified secret by the US government. GTDPL parsers have the property that even small changes in GTDPL parsers can be very labor-intensive. For some government contractors, GTDPL parsing provides steady work for years to come. Public interest in GTDPL fades.

1975: Bell Labs converts its C compiler from hand-written recursive descent to DeRemer's LALR algorithm.

1977: The first "Dragon book" comes out. This soon-to-be classic textbook is nicknamed after the drawing on the front cover, in which a knight takes on a dragon. Emblazoned on the knight's lance are the letters "LALR". From here on out, to speak lightly of LALR will be to besmirch the escutcheon of parsing theory.

1979: Bell Laboratories releases Version 7 UNIX. V7 includes what is, by far, the most comprehensive, useable and easily available compiler writing toolkit yet developed.

1979: Part of the V7 toolkit is Yet Another Compiler Compiler (YACC). YACC is LALR-powered. Despite its name, YACC is the first compiler-compiler in the modern sense. For some useful languages, the process of going from Chomskyan specification to executable is fully automated. Most practical languages, including the C language and YACC's own input language, still require manual hackery. Nonetheless, after two decades of research, it seems that the parsing problem is solved.

1987: Larry Wall introduces Perl 1. Perl embraces complexity like no previous language. Larry uses YACC and LALR very aggressively -- to my knowledge more aggressively than anyone before or since.

1991: Joop Leo discovers a way of speeding up right recursions in Earley's algorithm. Leo's algorithm is linear for just about every unambiguous grammar of practical interest, and many ambiguous ones as well. In 1991 hardware is six orders of magnitude faster than 1968 hardware, so that the issue of bookkeeping overhead had receded in importance. This is a major discovery. When it comes to speed, the game has changed in favor of the Earley algorithm.

But Earley parsing is almost forgotten. Twenty years will pass before anyone writes a practical implementation of Leo's algorithm.

1990's: Earley's is forgotten. So everyone in LALR-land is content, right? Wrong. Far from it, in fact. Users of LALR are making unpleasant discoveries. While LALR automatically generates their parsers, debugging them is so hard they could just as easily write the parser by hand. Once debugged, their LALR parsers are fast for correct inputs. But almost all they tell the users about incorrect inputs is that they are incorrect. In Larry's words, LALR is "fast but stupid".

2000: Larry Wall decides on a radical reimplementation of Perl -- Perl 6. Larry does not even consider using LALR again.

2002: John Aycock and R. Nigel Horspool publish their attempt at a fast, practical Earley's parser. Missing from it is Joop Leo's improvement -- they seem not to be aware of it. Their own speedup is limited in what it achieves and the complications it introduces can be counter-productive at evaluation time. But buried in their paper is a solution to the zero-length rule bug. And this time the solution requires no additional bookkeeping.

2004: Bryan Ford publishes his paper on PEG. Implementers by now are avoiding YACC, and it seems as if there might soon be no syntax-driven algorithms in practical use. Ford fills this gap by repackaging the nearly-forgotten GTDPL. Ford adds packratting, so that PEG is always linear, and provides PEG with an attractive new syntax. But nothing has been done to change the problematic behaviors of GTDPL.

2006: GNU announces that the GCC compiler's parser has been rewritten. For three decades, the industry's flagship C compilers have used LALR as their parser -- proof of the claim that LALR and serious parsing are equivalent. Now, GNU replaces LALR with the technology that it replaced a quarter century earlier: recursive descent.

Today: After five decades of parsing theory, the state of the art seems to be back where it started. We can imagine someone taking Ned Iron's original 1961 algorithm from the first paper ever published describing a parser, and republishing it today. True, he would have to translate its code from the mix of assembler and ALGOL into something more fashionable, say Haskell. But with that change, it might look like a breath of fresh air.

Marpa: an afterword

The recollections of my teachers cover most of this timeline. My own begin around 1970. Very early on, as a graduate student, I became unhappy with the way the field was developing. Earley's algorithm looked interesting, and it was something I returned to on and off.

The original vision of the 1960's was a parser that was

  • efficient,
  • practical,
  • general, and
  • syntax-driven.

By 2010 this vision seemed to have gone the same way as many other 1960's dreams. The rhetoric stayed upbeat, but parsing practice had become a series of increasingly desperate compromises.

But, while nobody was looking for them, the solutions to the problems encountered in the 1960's had appeared in the literature. Aycock and Horspool had solved the zero-length rule bug. Joop Leo had found the speedup for right recursion. And the issue of bookkeeping overhead had pretty much evaporated on its own. Machine operations are now a billion times faster than in 1968, and are probably no longer relevant in any case -- cache misses are now the bottleneck.

The programmers of the 1960's would have been prepared to trust a fully declarative Chomskyan parser. With the experience with LALR in their collective consciousness, modern programmers might be more guarded. As Lincoln said, "Once a cat's been burned, he won't even sit on a cold stove." But I found it straightforward to rearrange the Earley parse engine to allow efficient event-driven handovers between procedural and syntax-driven logic. And Earley tables provide the procedural logic with full knowledge of the state of the parse so far, so that Earley's algorithm is a better platform for hand-written procedural logic than recursive descent.

References, comments, etc.

My implementation of Earley's algorithm is called Marpa. For more about Marpa, there is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Perlgeek.de : Automating Deployments: Pipeline Templates in GoCD

In the last few blog post, you've seen the development of a GoCD pipeline for building a package, uploading it into repository for a testing environment, installing it in that environment, and then repeating the upload and installation cycle for a production environment.

To recap, this the XML config for GoCD so far:

<pipeline name="package-info">
  <materials>
    <git url="https://github.com/moritz/package-info.git" dest="package-info" materialName="package-info" />
    <git url="https://github.com/moritz/deployment-utils.git" dest="deployment-utils" materialName="deployment-utils" />
  </materials>
  <stage name="build" cleanWorkingDir="true">
    <jobs>
      <job name="build-deb" timeout="5">
        <tasks>
          <exec command="../deployment-utils/debian-autobuild" workingdir="#{package}" />
        </tasks>
        <artifacts>
          <artifact src="version" />
          <artifact src="package-info*_*" dest="package-info/" />
        </artifacts>
      </job>
    </jobs>
  </stage>
  <stage name="upload-testing">
    <jobs>
      <job name="upload-testing">
        <tasks>
          <fetchartifact pipeline="" stage="build" job="build-deb" srcdir="package-info">
            <runif status="passed" />
          </fetchartifact>
          <exec command="/bin/bash">
            <arg>-c</arg>
            <arg>deployment-utils/add-package testing jessie package-info_*.deb</arg>
          </exec>
        </tasks>
        <resources>
          <resource>aptly</resource>
        </resources>
      </job>
    </jobs>
  </stage>
  <stage name="deploy-testing">
    <jobs>
      <job name="deploy-testing">
        <tasks>
          <exec command="ansible" workingdir="deployment-utils/ansible/">
            <arg>--sudo</arg>
            <arg>--inventory-file=testing</arg>
            <arg>web</arg>
            <arg>-m</arg>
            <arg>apt</arg>
            <arg>-a</arg>
            <arg>name=package-info state=latest update_cache=yes</arg>
            <runif status="passed" />
          </exec>
        </tasks>
      </job>
    </jobs>
  </stage>
  <stage name="upload-production">
    <approval type="manual" />
    <jobs>
      <job name="upload-production">
        <tasks>
          <fetchartifact pipeline="" stage="build" job="build-deb" srcdir="package-info">
            <runif status="passed" />
          </fetchartifact>
          <exec command="/bin/bash">
            <arg>-c</arg>
            <arg>deployment-utils/add-package production jessie package-info_*.deb</arg>
          </exec>
        </tasks>
        <resources>
          <resource>aptly</resource>
        </resources>
      </job>
    </jobs>
  </stage>
  <stage name="deploy-production">
    <jobs>
      <job name="deploy-production">
        <tasks>
          <exec command="ansible" workingdir="deployment-utils/ansible/">
            <arg>--sudo</arg>
            <arg>--inventory-file=production</arg>
            <arg>web</arg>
            <arg>-m</arg>
            <arg>apt</arg>
            <arg>-a</arg>
            <arg>name=package-info state=latest update_cache=yes</arg>
            <runif status="passed" />
          </exec>
        </tasks>
      </job>
    </jobs>
  </stage>
</pipeline>

The interesting thing here is that the pipeline isn't very specific to this project. Apart from the package name, the Debian distribution and the group of hosts to which to deploy, everything in here can be reused to any software that's Debian packaged.

To make the pipeline more generic, we can define paramaters, short params

  <params>
    <param name="distribution">jessie</param>
    <param name="package">package-info</param>
    <param name="target">web</param>
  </params>

And then replace all the occurrences of package-info inside the stages definition with #{package}and so on:

  <stage name="build" cleanWorkingDir="true">
    <jobs>
      <job name="build-deb" timeout="5">
        <tasks>
          <exec command="../deployment-utils/debian-autobuild" workingdir="#{package}" />
        </tasks>
        <artifacts>
          <artifact src="version" />
          <artifact src="#{package}*_*" dest="#{package}/" />
        </artifacts>
      </job>
    </jobs>
  </stage>
  <stage name="upload-testing">
    <jobs>
      <job name="upload-testing">
        <tasks>
          <fetchartifact pipeline="" stage="build" job="build-deb" srcdir="#{package}">
            <runif status="passed" />
          </fetchartifact>
          <exec command="/bin/bash">
            <arg>-c</arg>
            <arg>deployment-utils/add-package testing #{distribution} #{package}_*.deb</arg>
          </exec>
        </tasks>
        <resources>
          <resource>aptly</resource>
        </resources>
      </job>
    </jobs>
  </stage>
  <stage name="deploy-testing">
    <jobs>
      <job name="deploy-testing">
        <tasks>
          <exec command="ansible" workingdir="deployment-utils/ansible/">
            <arg>--sudo</arg>
            <arg>--inventory-file=testing</arg>
            <arg>#{target}</arg>
            <arg>-m</arg>
            <arg>apt</arg>
            <arg>-a</arg>
            <arg>name=#{package} state=latest update_cache=yes</arg>
            <runif status="passed" />
          </exec>
        </tasks>
      </job>
    </jobs>
  </stage>
  <stage name="upload-production">
    <approval type="manual" />
    <jobs>
      <job name="upload-production">
        <tasks>
          <fetchartifact pipeline="" stage="build" job="build-deb" srcdir="#{package}">
            <runif status="passed" />
          </fetchartifact>
          <exec command="/bin/bash">
            <arg>-c</arg>
            <arg>deployment-utils/add-package production #{distribution} #{package}_*.deb</arg>
          </exec>
        </tasks>
        <resources>
          <resource>aptly</resource>
        </resources>
      </job>
    </jobs>
  </stage>
  <stage name="deploy-production">
    <jobs>
      <job name="deploy-production">
        <tasks>
          <exec command="ansible" workingdir="deployment-utils/ansible/">
            <arg>--sudo</arg>
            <arg>--inventory-file=production</arg>
            <arg>#{target}</arg>
            <arg>-m</arg>
            <arg>apt</arg>
            <arg>-a</arg>
            <arg>name=#{package} state=latest update_cache=yes</arg>
            <runif status="passed" />
          </exec>
        </tasks>
      </job>
    </jobs>
  </stage>

The next step towards generalization is to move the stages to a template. This can either be done again by editing the XML config, or in the web frontend with AdminPipelines and then clicking the Extract Template link next to the pipeline called package-info.

Either way, the result in the XML looks like this:

<pipelines group="deployment">
  <pipeline name="package-info" template="debian-base">
    <params>
      <param name="distribution">jessie</param>
      <param name="package">package-info</param>
      <param name="target">web</param>
    </params>
    <materials>
      <git url="https://github.com/moritz/package-info.git" dest="package-info" materialName="package-info" />
      <git url="https://github.com/moritz/deployment-utils.git" dest="deployment-utils" materialName="deployment-utils" />
    </materials>
  </pipeline>
</pipelines>
<templates>
  <pipeline name="debian-base">
      <!-- stages definitions go here -->
  </pipeline>
</templates>

Everything that's specific to this one software is now in the pipeline definition, and the reusable parts are in the template. With the sole exception of the deployment-utils repo, which must be added for software that is being automatically deployed, since GoCD has no way to move a material to a template.

Adding a deployment pipeline for another piece of software is now just a matter of specifying the URL, package name, target (that is, name of a group in the Ansible inventory file) and distribution. So about a minute of work once you're used to the tooling.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Dave's Free Press: Journal: Thanks, Yahoo!

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 2

Ocean of Awareness: What are the reasonable computer languages?

"You see things; and you say 'Why?' But I dream things that never were; and I say 'Why not?'" -- George Bernard Shaw

In the 1960's and 1970's computer languages were evolving rapidly. It was not clear which way they were headed. Would most programming be done with general-purpose languages? Or would programmers create a language for every task domain? Or even for every project? And, if lots of languages were going to be created, what kinds of languages would be needed?

It was in that context that Čulik and Cohen, in a 1973 paper, outlined what they thought programmers would want and should have. In keeping with the spirit of the time, it was quite a lot:

  • Programmers would want to extend their grammars with new syntax, including new kinds of expressions.
  • Programmers would also want to use tools that automatically generated new syntax.
  • Programmers would not want to, and especially in the case of auto-generated syntax would usually not be able to, massage the syntax into very restricted forms. Instead, programmers would create grammars and languages which required unlimited lookahead to disambiguate, and they would require parsers which could handle these grammars.
  • Finally, programmers would need to be able to rely on all of this parsing being done in linear time.

Today, we think we know that Čulik and Cohen's vision was naive, because we think we know that parsing technology cannot support it. We think we know that parsing is much harder than they thought.

The eyeball grammars

As a thought problem, consider the "eyeball" class of grammars. The "eyeball" class of grammars contains all the grammars that a human can parse at a glance. If a grammar is in the eyeball class, but a computer cannot parse it, it presents an interesting choice. Either,

  • your computer is not using the strongest practical algorithm; or
  • your mind is using some power which cannot be reduced to a machine computation.

There are some people out there (I am one of them) who don't believe that everything the mind can do reduces to a machine computation. But even those people will tend to go for the choice in this case: There must be some practical computer parsing algorithm which can do at least as well at parsing as a human can do by "eyeball". In other words, the class of "reasonable grammars" should contain the eyeball class.

Čulik and Cohen's candidate for the class of "reasonable grammars" were the grammars that a deterministic parse engine could parse if it had a lookahead that was infinite, but restricted to distinguishing between regular expressions. They called these the LR-regular, or LRR, grammars. And the LRR grammars do in fact seem to be a good first approximation to the eyeball class. They do not allow lookahead that contains things that you have to count, like palindromes. And, while I'd be hard put to eyeball every possible string for every possible regular expression, intuitively the concept of scanning for a regular expression does seem close to capturing the idea of glancing through a text looking for a telltale pattern.

So what happened?

Alas, the algorithm in the Čulik and Cohen paper turned out to be impractical. But in 1991, Joop Leo discovered a way to adopt Earley's algorithm to parse the LRR grammars in linear time, without doing the lookahead. And Leo's algorithm does have a practical implementation: Marpa.

References, comments, etc.

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: YAPC::Europe 2007 travel plans

Perlgeek.de : Automating Deployments: Stage 2: Uploading

Once you have the pipeline for building a package, it's time to distribute the freshly built package to the machines where it's going to be installed on.

I've previously explained the nuts and bolts of getting a Debian package into a repository managed by aptly so it's time to automate that.

Some Assumptions

We are going to need a separate repository for each environment we want to deploy to (or maybe group of environments; it might be OK and even desirable to share a repository between various testing environments that can be used in parallel, for example for security, performance and functional testing).

At some point in the future, when a new version of the operating system is released, we'll also need to build packages for another major version, so for example for Debian stretch instead of jessie. So it's best to plan for that case. Based on these assumptions, the path to each repository will be $HOME/aptly/$environment/$distribution.

For the sake of simplicity, I'm going to assume a single host on which both testing and production repositories will be hosted on from separate directories. If you need those repos on separate servers, it's easy to reverse that decision (or make a different one in the first place).

To easy the transportation and management of the repository, a GoCD agent should be running on the repo server. It can copy the packages from the GoCD server's artifact repository with built-in commands.

Scripting the Repository Management

It would be possible to manually initialize each repository, and only automate the process of adding a package. But since it's not hard to do, taking the opposite route of creating automatically on the fly is more reliable. The next time you need a new environment or need to support a new distribution you will benefit from this decision.

So here is a small Perl program that, given an environment, distribution and a package file name, creates the aptly repo if it doesn't exist yet, writes the config file for the repo, and adds the package.

#!/usr/bin/perl
use strict;
use warnings;
use 5.014;
use JSON qw(encode_json);
use File::Path qw(mkpath);
use autodie;

unless ( @ARGV == 3) {
    die "Usage: $0 <environment> <distribution> <.deb file>\n";
}
my ( $env, $distribution, $package ) = @ARGV;

my $base_path   = "$ENV{HOME}/aptly";
my $repo_path   = "$base_path/$env/$distribution";
my $config_file = "$base_path/$env-$distribution.conf";
my @aptly_cmd   = ("aptly", "-config=$config_file");

init_config();
init_repo();
add_package();


sub init_config {
    mkpath $base_path;
    open my $CONF, '>:encoding(UTF-8)', $config_file;
    say $CONF encode_json( {
    rootDir => $repo_path,
    architectures => [qw( i386 amd64 all )],
    });
    close $CONF;
}

sub init_repo {
    return if -d "$repo_path/db";
    mkpath $repo_path;
    system @aptly_cmd, "repo", "create", "-distribution=$distribution", "myrepo";
    system @aptly_cmd, "publish", "repo", "myrepo";
}

sub add_package {
    system @aptly_cmd,  "repo", "add", "myrepo", $package;
    system @aptly_cmd,  "publish", "update", $distribution;
}

As always, I've developed and tested this script interactively, and only started to plug it into the automated pipeline once I was confident that it did what I wanted.

And as all software, it's meant to be under version control, so it's now part of the deployment-utils git repo.

More Preparations: GPG Key

Before GoCD can upload the debian packages into a repository, the go agent needs to have a GPG key that's not protected by a password. You can either log into the go system user account and create it there with gpg --gen-key, or copy an existing .gnupg directory over to ~go (don't forget to adjust the ownership of the directory and the files in there).

Integrating the Upload into the Pipeline

The first stage of the pipeline builds the Debian package, and records the resulting file as an artifact. The upload step needs to retrieve this artifact with a fetchartifact task. This is the config for the second stage, to be inserted directly after the first one:

  <stage name="upload-testing">
    <jobs>
      <job name="upload-testing">
        <tasks>
          <fetchartifact pipeline="" stage="build" job="build-deb" srcdir="package-info">
            <runif status="passed" />
          </fetchartifact>
          <exec command="/bin/bash">
            <arg>-c</arg>
            <arg>deployment-utils/add-package testing jessie package-info_*.deb</arg>
          </exec>
        </tasks>
        <resources>
          <resource>aptly</resource>
        </resources>
      </job>
    </jobs>
  </stage>

Note that testing here refers to the name of the environment (which you can chose freely, as long as you are consistent), not the testing distribution of the Debian project.

There is a aptly resource, which you must assign to the agent running on the repo server. If you want separate servers for testing and production repositories, you'd come up with a more specific resource name here (for example `aptly-testing^) and a separate one for the production repository.

Make the Repository Available through HTTP

To make the repository reachable from other servers, it needs to be exposed to the network. The most convenient way is over HTTP. Since only static files need to be served (and a directory index), pretty much any web server will do.

An example config for lighttpd:

dir-listing.encoding = "utf-8"
server.dir-listing   = "enable"
alias.url = ( 
    "/debian/testing/jessie/"    => "/var/go/aptly/testing/jessie/public/",
    "/debian/production/jessie/" => "/var/go/aptly/production/jessie/public/",
    # more repos here
)

And for the Apache HTTP server, once you've configured a virtual host:

Options +Indexes
Alias /debian/testing/jessie/     /var/go/aptly/testing/jessie/public/
Alias /debian/production/jessie/  /var/go/aptly/production/jessie/public/
# more repos here

Achievement Unlocked: Automatic Build and Distribution

With theses steps done, there is automatic building and upload of packages in place. Since client machines can pull from that repository at will, we can tick off the distribution of packages to the client machines.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Dave's Free Press: Journal: Wikipedia handheld proxy

Dave's Free Press: Journal: Bryar security hole

Dave's Free Press: Journal: POD includes

Perlgeek.de : Automating Deployments: Installation in the Pipeline

As [mentioned before](perlgeek.de/blog-en/automating-deployments/2016-007-installing-packages.html), my tool of choice for automating package installation is [ansible](https://deploybook.com/resources).

The first step is to create an inventory file for ansible. In a real deployment setting, this would contain the hostnames to deploy to. For the sake of this project I just have a test setup consisting of virtual machines managed by vagrant, which leads to a somewhat unusual ansible configuration.

That's the ansible.cfg:

[defaults]
remote_user = vagrant
host_key_checking = False

And the inventory file called testing for the testing environment:

[web]
testserver ansible_ssh_host=127.0.0.1 ansible_ssh_port=2345 

(The host is localhost here, because I run a vagrant setup to test the pipeline; In a real setting, it would just be the hostname of your test machine).

All code and configuration goes to version control, I created an ansible directory in the deployment-utils repo and dumped the files there.

Finally I copied the ssh private key (from vagrant ssh-config) to /var/go/.ssh/id_rsa, adjusted the owner to user go, and was ready to go.

Plugging it into GoCD

Automatically installing a newly built package through GoCD in the testing environment is just another stage away:

  <stage name="deploy-testing">
    <jobs>
      <job name="deploy-testing">
        <tasks>
          <exec command="ansible" workingdir="deployment-utils/ansible/">
            <arg>--sudo</arg>
            <arg>--inventory-file=testing</arg>
            <arg>web</arg>
            <arg>-m</arg>
            <arg>apt</arg>
            <arg>-a</arg>
            <arg>name=package-info state=latest update_cache=yes</arg>
            <runif status="passed" />
          </exec>
        </tasks>
      </job>
    </jobs>
  </stage>

The central part is an invocation of ansible in the newly created directory of the deployment--utils repository.

Results

To run the new stage, either trigger a complete run of the pipeline by hitting the "play" triangle in the pipeline overview in web frontend, or do a manual trigger of that one stage in the pipe history view.

You can log in on the target machine to check if the package was successfully installed:

vagrant@debian-jessie:~$ dpkg -l package-info
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version      Architecture Description
+++-==============-============-============-=================================
ii  package-info   0.1-0.7.1    all          Web service for getting a list of

and verify that the service is running:

vagrant@debian-jessie:~$ systemctl status package-info
● package-info.service - Package installation information via http
   Loaded: loaded (/lib/systemd/system/package-info.service; static)
   Active: active (running) since Sun 2016-03-27 13:15:41 GMT; 4h 6min ago
  Process: 4439 ExecStop=/usr/bin/hypnotoad -s /usr/lib/package-info/package-info (code=exited, status=0/SUCCESS)
 Main PID: 4442 (/usr/lib/packag)
   CGroup: /system.slice/package-info.service
           ├─4442 /usr/lib/package-info/package-info
           ├─4445 /usr/lib/package-info/package-info
           ├─4446 /usr/lib/package-info/package-info
           ├─4447 /usr/lib/package-info/package-info
           └─4448 /usr/lib/package-info/package-info

and check that it responds on port 8080, as it's supposed to:

    vagrant@debian-jessie:~$ curl http://127.0.0.1:8080/|head -n 7
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
      0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0Desired=Unknown/Install/Remove/Purge/Hold
    | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
    |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
    ||/ Name                           Version                     Architecture Description
    +++-==============================-===========================-============-===============================================================================
    ii  acl                            2.2.52-2                    amd64        Access control list utilities
    ii  acpi                           1.7-1                       amd64        displays information on ACPI devices
    curl: (23) Failed writing body (2877 != 16384)

The last line is simply curl complaining that it can't write the full output, due to the pipe to head exiting too early to receive all the contents. We can safely ignore that.

Going All the Way to Production

Uploading and deploying to production works the same as with the testing environment. So all that's needed is to duplicate the configuration of the last two pipelines, replace every occurrence of testing with pproduction, and add a manual approval button, so that production deployment remains a conscious decision:

  <stage name="upload-production">
    <approval type="manual" />
    <jobs>
      <job name="upload-production">
        <tasks>
          <fetchartifact pipeline="" stage="build" job="build-deb" srcdir="package-info">
            <runif status="passed" />
          </fetchartifact>
          <exec command="/bin/bash">
            <arg>-c</arg>
            <arg>deployment-utils/add-package production jessie package-info_*.deb</arg>
          </exec>
        </tasks>
        <resources>
          <resource>aptly</resource>
        </resources>
      </job>
    </jobs>
  </stage>
  <stage name="deploy-production">
    <jobs>
      <job name="deploy-production">
        <tasks>
          <exec command="ansible" workingdir="deployment-utils/ansible/">
            <arg>--sudo</arg>
            <arg>--inventory-file=production</arg>
            <arg>web</arg>
            <arg>-m</arg>
            <arg>apt</arg>
            <arg>-a</arg>
            <arg>name=package-info state=latest update_cache=yes</arg>
            <runif status="passed" />
          </exec>
        </tasks>
      </job>
    </jobs>
  </stage>

The only real news here is the second line:

    <approval type="manual" />

which makes GoCD only proceed to this stage when somebody clicks the approval arrow in the web interface.

You also need to fill out the inventory file called production with the list of your server or servers.

Achievement Unlocked: Basic Continuous Delivery

Let's recap, the pipeline

  • is triggered automatically from commits in the source code
  • automatically builds a Debian package from each commit
  • uploads it to a repository for the testing environment
  • automatically installs it in the testing environment
  • upon manual approval, uploads it to a repository for the production environment
  • ... and automatically installs the new version in production.

So the basic framework for Continuous Delivery is in place.

Wow, that escalated quickly.

Missing Pieces

Of course, there's lots to be done before we can call this a fully-fledged Continuous Delivery pipeline:

  • Automatic testing
  • Generalization to other software
  • version pinning (always installing the correct version, not the newest one).
  • Rollbacks
  • Data migration

But even as is, the pipeline can provide quite some time savings and shortened feedback cycles. The manual approval before production deployment is a good hook for manual tasks, such as manual tests.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Dave's Free Press: Journal: cgit syntax highlighting

Perlgeek.de : Ansible: A Primer

Ansible is a very pragmatic and powerful configuration management system that is easy to get started with.

Connections and Inventory

Ansible is typically used to connect to one or more remote hosts via ssh and bring them into a desired state. The connection method is pluggable: other methods include local, which simply invokes the commands on the local host instead, and docker, which connects through the Docker daemon to configure a running container.

To tell Ansible where and how to connect, you write an inventory file, called hosts by default. In the inventory file, you can define hosts and groups of hosts, and also set variables that control how to connect to them.

# file myinventory
# example inventory file
[all:vars]
# variables set here apply to all hosts
ansible_user=root

[web]
# a group of webservers
www01.example.com
www02.example.com

[app]
# a group of 5 application servers,
# all following the same naming scheme:
app[01:05].example.com

[frontend:children]
# a group that combines the two previous groups
app
web

[database]
# here we override ansible_user for just one host
db01.example.com ansible_user=postgres

(In versions prior to Ansible 2.0, you have to use ansible_ssh_user instead of ansible_user). See the introduction to inventory files for more information.

To test the connection, you can use the ping module on the command line:

$ ansible -i myinventory web -m ping
www01.example.com | success >> {
    "changed": false,
    "ping": "pong"
}

www02.example.com | success >> {
    "changed": false,
    "ping": "pong"
}

Let's break the command line down into its components: -i myinventory tells Ansible to use the myinventory file as inventory. web tells Ansible which hosts to work on. It can be a group, as in this example, or a single host, or several such things separated by a colon. For example, www01.example.com:database would select one of the web servers and all of the database servers. Finally, -m ping tells Ansible which module to execute. ping is probably the simplest module, it simply sends the response "pong" and that the remote host hasn't changed.

These commands run in parallel on the different hosts, so the order in which these responses are printed can vary.

If there is a problem with connecting to a host, add the option -vvv to get more output.

Ansible implicitly gives you the group all which -- you guessed it -- contains all the hosts configured in the inventory file.

Modules

Whenever you want to do something on a host through Ansible, you invoke a module to do that. Modules usually take arguments that specify what exactly should happen. On the command line, you can add those arguments with `ansible -m module -a 'arguments', for example

$ ansible -i myinventory database -m shell -a 'echo "hi there"'
db01.exapmle.com | success | rc=0 >>
hi there

Ansible comes with a wealth of built-in modules and an ecosystem of third-party modules as well. Here I want to present just a few, commonly-used modules.

The shell Module

The shell module executes a shell command on the host and accepts some options such as chdir to change into another working directory first:

$ ansible -i myinventory database -m shell -e 'pwd chdir=/tmp'
db01.exapmle.com | success | rc=0 >>
/tmp

It is pretty generic, but also an option of last resort. If there is a more specific module for the task at hand, you should prefer the more specific module. For example you could ensure that system users exist using the shell module, but the more specialized user module is much easier to use for that, and likely does a better job than an improvised shell script.

The copy Module

With copy you can copy files verbatim from the local to the remote machine:

$ ansible -i myinventory database -m copy -a 'src=README.md dest=/etc/motd mode=644
db01.example.com | success >> {
    "changed": true,
    "dest": "/etc/motd",
    "gid": 0,
    "group": "root",
    "md5sum": "d41d8cd98f00b204e9800998ecf8427e",
    "mode": "0644",
    "owner": "root",
    "size": 0,
    "src": "/root/.ansible/tmp/ansible-tmp-1467144445.16-156283272674661/source",
    "state": "file",
    "uid": 0
}

The template Module

template mostly works like copy, but it interprets the source file as a Jinja2 template before transferring it to the remote host.

This is commonly used to create configuration files and to incorporate information from variables (more on that later).

Templates cannot be used directly from the command line, but rather in playbooks, so here is an example of a simple playbook.

# file motd.j2
This machine is managed by {{team}}.


# file template-example.yml
---
- hosts: all
  vars:
    team: Slackers
  tasks:
   - template: src=motd.j2 dest=/etc/motd mode=0644

More on playbooks later, but what you can see is that this defines a variable team, sets it to the value Slacker, and the template interpolates this variable.

When you run the playbook with

$ ansible-playbook -i myinventory --limit database template-example.yml

It creates a file /etc/motd on the database server with the contents

This machine is manged by Slackers.

The file Module

The file module manages attributes of file names, such as permissions, but also allows you create directories, soft and hard links.

$ ansible -i myinventory database -m file -a 'path=/etc/apt/sources.list.d state=directory mode=0755'
db01.example.com | success >> {
    "changed": false,
    "gid": 0,
    "group": "root",
    "mode": "0755",
    "owner": "root",
    "path": "/etc/apt/sources.list.d",
    "size": 4096,
    "state": "directory",
    "uid": 0
}

The apt Module

On Debian and derived distributions, such as Ubuntu, installing and removing packages is generally done with package managers from the apt family, such as apt-get, aptitude, and in newer versions, the apt binary directly.

The apt module manages this from within Ansible:

$ ansible -i myinventory database -m apt -a 'name=screen state=installed update_cache=yes'
db01.example.com | success >> {
    "changed": false
}

Here the screen package was already installed, so the module didn't change the state of the system.

Separate modules are available for managing apt-keys with which repositories are cryptographically verified, and for managing the repositories themselves.

The yum and zypper Modules

For RPM-based Linux distributions, the yum module (core) and zypper module (not in core, so must be installed separately) are available. They manage package installation via the package managers of the same name.

The package Module

The package module tries to use whatever package manager it detects. It is thus more generic than the apt and yum modules, but supports far fewer features. For example in the case of apt, it does not provide any control over whether to run apt-get update before doing anything else.

Application-Specific Modules

The modules presented so far are fairly close to the system, but there are also modules for achieving common, application specific tasks. Examples include dealing with databases, network related things such as proxies, version control systems, clustering solutions such as Kubernetes, and so on.

Playbooks

Playbooks can contain multiple calls to modules in a defined order and limit their execution to individual or group of hosts.

They are written in the YAML file format, a data serialization file format that is optimized for human readability.

Here is an example playbook that installs the newest version of the go-agent Debian package, the worker for Go Continuous Delivery:

---
- hosts: go-agent
  vars:
    go_server: hack.p6c.org
  tasks:
  - apt: package=apt-transport-https state=installed
  - apt_key: url=https://download.go.cd/GOCD-GPG-KEY.asc state=present validate_certs=no
  - apt_repository: repo='deb https://download.go.cd /' state=present
  - apt: update_cache=yes package={{item}} state=installed
    with_items:
     - go-agent
     - git
     - build-essential
  - lineinfile: dest=/etc/default/go-agent regexp=^GO_SERVER= line=GO_SERVER={{ go_server }}
  - service: name=go-agent enabled=yes state=started

The top level element in this file is a one-element list. The single element starts with hosts: go-agent, which limits execution to hosts in the group go-agent. This is the relevant part of the inventory file that goes with it:

[go-agent]
go-worker01.p6c.org
go-worker02.p6c.org

Then it sets the variable go_server to a string, here this is the hostname where a GoCD server runs.

Finally, the meat of the playbook: the list of tasks to execute.

Each task is a call to a module, some of which have already been discussed. A quick overview:

  • First, the Debian package apt-transport-https is installed, to make sure that the system can fetch meta data and files from Debian repositories through HTTPS
  • The next two tasks use the apt_repository and apt_key modules to configure the repository from which the actual go-agent package shall be installed
  • Another call to apt installs the desired package. Also, some more packages are installed with a loop construct
  • The lineinfile module searches by regex for a line in a text file, and replaces the appropriat line with pre-defined content. Here we use that to configure the GoCD server that the agent connects to.
  • Finally, the service module starts the agent if it's not yet running (state=started), and ensures that it is automatically started on reboot (enabled=yes).

Playbooks are invoked with the ansible-playbook command.

There can be more than one list of tasks in a playbook, which is a common use-case when they affect different groups of hosts:

---
- hosts: go-agent:go-server
  tasks:
  - apt: package=apt-transport-https state=installed
  - apt_key: url=https://download.go.cd/GOCD-GPG-KEY.asc state=present validate_certs=no
  - apt_repository: repo='deb https://download.go.cd /' state=present

- hosts: go-agent
  - apt: update_cache=yes package={{item}} state=installed
    with_items:
     - go-agent
     - git
     - build-essential
   - ...

- hosts: go-server
  - apt: update_cache=yes package=go-server state=installed
  - ...

Variables

Variables are useful both for controlling flow inside a playbook, and for filling out spots in templates to generate configuration files.

There are several ways to set variables. One is directly in playbooks, via vars: ..., as seen before. Another is to specify them at the command line:

ansible-playbook --extra-vars=variable=value theplaybook.yml

Another, very flexible way is to use the group_vars feature. For each group that a host is in, Ansible looks for a file group_vars/thegroup.yml and for files matching `group_vars/thegroup/*.yml. A host can be in several groups at once, which gives you quite some flexibility.

For example, you can put each host into two groups, one for the role the host is playing (like webserver, database server, DNS server etc.), and one for the environment it is in (test, staging, prod). Here is a small example that uses this layout:

# environments
[prod]
www[01:02].example.com
db01.example.com

[test]
db01.test.example.com
www01.test.example.com


# functional roles
[web]
www[01:02].example.com
www01.test.example.com

[db]
db01.example.com
db01.test.example.com

To roll out only the test hosts, you can run

ansible-playbook --limit test theplaybook.yml

and put environment-specific variables in group_vars/test.yml and group_vars/prod.yml, and web server specific variables in group_vars/web.yml etc.

You can use nested data structures in your variables, and if you do, you can configure Ansible to merge those data structures for you. You can configure it by creating a file called ansible.cfg with this content:

[defaults]
hash_behavior=merge

That way, you can have a file group_vars/all.yml that sets the default values:

# file group_vars/all.yml
myapp:
    domain: example.com
    db:
        host: db.example.com
        username: myappuser
        instance. myapp

And then override individual elements of that nested data structure, for example in group_vars/test.yml:

# file group_vars/test.yml
myapp:
    domain: text.example.com
    db:
        hostname: db.test.example.com

The keys that the test group vars file didn't touch, for example myapp.db.username, are inherited from the file all.yml.

Roles

Roles are a way to encapsulate parts of a playbook into a reusable component.

Let's consider a real world example that leads to a simple role definition.

For deploying software, you always want to deploy the exact version you want to build, so the relevant part of the playbook is

- apt: name=thepackage={{package_version}} state=present update_cache=yes force=yes

But this requires you to supply the package_version variable whenever you run the playbook, which will not be practical when you instead configure a new machine and need to install several software packages, each with their own playbook.

Hence, we generalize the code to deal with the case that the version number is absent:

- apt: name=thepackage={{package_version}} state=present update_cache=yes force=yes
  when: package_version is defined
- apt: name=thepackage state=present update_cache=yes
  when: package_version is undefined

If you run several such playbooks on the same host, you'll notice that it likely spends most of its time running apt-get update for each playbook. This is necessary the first time, because you might have just uploaded a new package on your local Debian mirror prior to the deployment, but subsequent runs are unnecessary. So you can store the information that a host has already updated its cache in a fact, which is a per-host kind of variable in Ansible.

- apt: update_cache=yes
  when: apt_cache_updated is undefined

- set_fact:
    apt_cache_updated: true

As you can see, the code base for sensibly installing a package has grown a bit, and it's time to factor it out into a role.

Roles are collections of YAML files, with pre-defined names. The commands

$ mkdir roles
$ cd roles
$ ansible-galaxy init custom_package_installation

create an empty skeleton for a role named custom_package_installation. The tasks that previously went into all the playbooks now go into the file tasks/main.yml below the role's main directory:

# file roles/custom_package_installation/tasks/main.yml
- apt: update_cache=yes
  when: apt_cache_updated is undefined
- set_fact:
    apt_cache_updated: true

- apt: name={{package}={{package_version}} state=present update_cache=yes force=yes
  when: package_version is defined
- apt: name={{package} state=present update_cache=yes
  when: package_version is undefined

To use the role, first add the line roles_path = roles in the file ansible.cfg in the [default] section, and then in a playbook, include it like this:

---
- hosts: web
  pre_tasks:
     - # tasks that are execute before the role(s)
  roles: { role: custom_package_installation, package: python-matheval }
  tasks:
    - # tasks that are executed after the role(s)

pre_tasks and tasks are optional; a playbook consisting of only roles being included is totally fine.

Summary

Ansible offers a pragmatic approach to configuration management, and is easy to get started with.

It offers modules for low-level tasks such as transferring files and executing shell commands, but also higher-level task like managing packages and system users, and even application-specific tasks such as managing PostgreSQL and MySQL users.

Playbooks can contain multiple calls to modules, and also use and set variables and consume roles.

Ansible has many more features, like handlers, which allow you to restart services only once after any changes, dynamic inventories for more flexible server landscapes, vault for encrypting variables, and a rich ecosystem of existing roles for managing common applications and middleware.

For learning more about Ansible, I highly recommend the excellent book Ansible: Up and Running by Lorin Hochstein.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Perlgeek.de : Automatically Deploying Specific Versions

Versions. Always talking about versions. So, more talk about versions.

The installation pipeline from a previous installment always installs the newest version available. In a normal, simple, linear development flow, this is fine, and even in other workflows, it's a good first step.

But we really want the pipeline to deploy the exact versions that was built inside the same instance of the pipeline. The obvious benefit is that it allows you to rerun older versions of the pipeline to install older versions, effectively giving you a rollback.

Or you can build a second pipeline for hotfixes, based on the same git repository but a different branch, and when you do want a hotfix, you simply pause the regular pipeline, and trigger the hotfix pipeline. In this scenario, if you always installed the newest version, finding a proper version string for the hotfix is nearly impossible, because it needs to be higher than the currently installed one, but also lower than the next regular build. Oh, and all of that automatically please.

A less obvious benefit to installing a very specific version is that it detects error in the package source configuration of the target machines. If the deployment script just installs the newest version that's available, and through an error the repository isn't configured on the target machine, the installation process becomes a silent no-op if the package is already installed in an older version.

Implementation

There are two things to do: figure out version to install of the package, and and then do it.

The latter step is fairly easy, because the ansible "apt" module that I use for installation supports, and even has an example in the documentation:

# Install the version '1.00' of package "foo"
- apt: name=foo=1.00 state=present

Experimenting with this feature shows that in case this is a downgrade, you also need to add force=yes.

Knowing the version number to install also has a simple, though maybe not obvious solution: write the version number to a file, collect this file as an artifact in GoCD, and then when it's time to install, fetch the artifact, and read the version number from it.

When I last talked about the build step, I silently introduced configuration that collects the version file that the debian-autobuild script writes:

  <job name="build-deb" timeout="5">
    <tasks>
      <exec command="../deployment-utils/debian-autobuild" workingdir="#{package}" />
    </tasks>
    <artifacts>
      <artifact src="version" />
      <artifact src="package-info*_*" dest="package-info/" />
    </artifacts>
  </job>

So only the actual installation step needs adjusting. This is what the configuration looked like:

  <job name="deploy-testing">
    <tasks>
      <exec command="ansible" workingdir="deployment-utils/ansible/">
        <arg>--sudo</arg>
        <arg>--inventory-file=testing</arg>
        <arg>web</arg>
        <arg>-m</arg>
        <arg>apt</arg>
        <arg>-a</arg>
        <arg>name=package-info state=latest update_cache=yes</arg>
        <runif status="passed" />
      </exec>
    </tasks>
  </job>

So, first fetch the version file:

      <job name="deploy-testing">
        <tasks>
          <fetchartifact pipeline="" stage="build" job="build-deb" srcfile="version" />
          ...

Then, how to get the version from the file to ansible? One could either use ansible's lookup('file', path) function, or write a small script. I decided to the latter, since I was originally more aware of bash's capabilities than of ansible's, and it's only a one-liner anyway:

          ...
          <exec command="/bin/bash" workingdir="deployment-utils/ansible/">
            <arg>-c</arg>
            <arg>ansible --sudo --inventory-file=testing #{target} -m apt -a "name=#{package}=$(&lt; ../../version) state=present update_cache=yes force=yes"</arg>
          </exec>
        </tasks>
      </job>

Bash's $(...) opens a sub-process (which again is a bash instance), and inserts the output from that sub-process into the command line. < ../../version is a short way of reading the file. And, this being XML, the less-than sign needs to be escaped.

The production deployment configuration looks pretty much the same, just with --inventory-file=production.

Try it!

To test the version-specific package installation, you need to have at least two runs of the pipeline that captured the version artifact. If you don't have that yet, you can push commits to the source repository, and GoCD picks them up automatically.

You can query the installed version on the target machine with dpkg -l package-info. After the last run, the version built in that pipeline instance should be installed.

Then you can rerun the deployment stage from a previous pipeline, for example in the history view of the pipeline by hovering with the mouse over the stage, and then clicking on the circle with the arrow on it that triggers the rerun.

After the stage rerun has completed, checking the installed version again should yield the version built in the pipeline instance that you selected.

Conclusions

Once you know how to set up your pipeline to deploy exactly the version that was built in the same pipeline instance, it is fairly easy to implement.

Once you've done that, you can easily deploy older versions of your software as a step back scenario, and use the same mechanism to automatically build and deploy hotfixes.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Perlgeek.de : Automating Deployments and Configuration Management

New software versions often need new configuration as well. How do you make sure that the necessary configuration arrives on a target machine at the same time (or before) the software release that introduces them?

The obvious approach is to put the configuration in version control too and deploy it alongside the software.

Taking configuration from a source repository and applying it to running machines is what configuration management software does.

Since Ansible has been used for deployment in the examples so far -- and it's a good configuration management system as well -- it is an obvious choice to use here.

Benefits of Configuration Management

When your infrastructure scales to many machines, and you don't want your time and effort to scale linearly with them, you need to automate things. Keeping configuration consistent across different machines is a requirement, and configuration management software helps you achieve that.

Furthermore, once the configuration comes from a canonical source with version control, tracking and rolling back configuration changes becomes trivial. If there is an outage, you don't need to ask all potentially relevant colleagues whether they changed anything -- your version control system can easily tell you. And if you suspect that a recent change caused the outage, reverting it to see if the revert works is a matter of seconds or minutes.

Once configuration and deployment are automated, building new environments, for example for penetration testing, becomes a much more manageable task.

Capabilities of a Configuration Management System

Typical tasks and capabilities of configuration management software include things like connecting to the remote host, copying files to the host (and often adjusting parameters and filling out templates in the process), ensuring that operating system packages are installed or absent, creating users and groups, controlling services, and even executing arbitrary commands on the remote host.

With Ansible, the connection to the remote host is provided by the core, and the actual steps to be executed are provided by modules. For example the apt_repository module can be used to manage repository configuration (i.e. files in /etc/apt/sources.list.d/), the apt module installs, upgrades, downgrades or removes packages, and the template module typically generates configuration files from variables that the user defined, and from facts that Ansible itself gathered.

There are also higher-level Ansible modules available, for example for managing Docker images, or load balancers from the Amazon cloud.

A complete introduction to Ansible is out of scope here, but I can recommend the online documentation, as well as the excellent book Ansible: Up and Running by Lorin Hochstein.

To get a feeling for what you can do with Ansible, see the ansible-examples git repository.

Assuming that you will find your way around configuration management with Ansible through other resources, I want to talk about how you can integrate it into the deployment pipeline instead.

Integrating Configuration Management with Continuous Delivery

The previous approach of writing one deployment playbook for each application can serve as a starting point for configuration management. You can simply add more tasks to the playbook, for example for creating the configuration files that the application needs. Then each deployment automatically ensures the correct configuration.

Since most modules in Ansible are idempotent, that is, repeated execution doesn't change the state of the system after the first time, adding additional tasks to the deployment playbook only becomes problematic when performance suffers. If that happens, you could start to extract some slow steps out into a separate playbook that doesn't run on each deployment.

If you provision and configure a new machine, you typically don't want to manually trigger the deploy step of each application, but rather have a single command that deploys and configures all of the relevant applications for that machine. So it makes sense to also have a playbook for deploying all relevant applications. This can be as simple as a list of include statements that pull in the individual application's playbooks.

You can add another pipeline that applies this "global" configuration to the testing environment, and after manual approval, in the production environment as well.

Stampedes and Synchronization

In the scenario outlined above, the configuration for all related applications lives in the same git repository, and is used a material in the build and deployment pipelines for all these applications.

A new commit in the configuration repository then triggers a rebuild of all the applications. For a small number of applications, that's usually not a problem, but if you have a dozen or a few dozen applications, this starts to suck up resources unnecessarily, and also means no build workers are available for some time to build changes triggered by actual code changes.

To avoid these build stampedes, a pragmatic approach is to use ignore filters in the git materials. Ignore filters are typically used to avoid rebuilds when only documentation changes, but can also be used to prevent any changes in a repository to trigger a rebuild.

If, in the <materials> section of your GoCD pipeline, you replace

<git url="https://github.com/moritz/deployment-utils.git" dest="deployment-utils" materialName="deployment-utils" />

With

<git url="https://github.com/moritz/deployment-utils.git" dest="deployment-utils" materialName="deployment-utils">
  <filter>
    <ignore pattern="**/*" />
    <ignore pattern="*" />
  </filter>
</git>

then a newly pushed commit to the deployment-utils repo won't trigger this pipeline. A new build, triggered either manually or from a new commit in the application's git repository, still picks up the newest version of the deployment-utils repository.

In the pipeline that deploys all of the configuration, you wouldn't add such a filter.

Now if you change some playbooks, the pipeline for the global configuration runs and rolls out these changes, and you promote the newest version to production. When you then deploy one of your applications to production, and the build happened before the changes to the playbook, it actually uses an older version of the playbook.

This sounds like a very unfortunate constellation, but it turns out not to be so bad. The combination of playbook version and application version worked in testing, so it should work in production as well.

To avoid using an older playbook, you can trigger a rebuild of the application, which automatically uses the newest playbook version.

Finally, in practice it is a good idea to bring most changes to production pretty quickly anyway. If you don't do that, you lose overview of what changed, which leads to growing uncertainty about whether a production release is safe. If you follow this ideal of going quickly to production, the version mismatches between the configuration and application pipelines should never become big enough to worry about.

Conclusion

The deployment playbooks that you write for your applications can be extended to do full configuration management for these applications. You can create a "global" Ansible playbook that includes those deployment playbooks, and possibly other configuration, such as basic configuration of the system.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Perlgeek.de : Automating Deployments: Version Recycling Considered Harmful

In the previous installment we saw a GoCD configuration that automatically built a Debian package from a git repository whenever somebody pushes a new commit to the git repo.

The version of the generated Debian package comes from the debian/changelog file of the git repository. Which means that whenever somebody pushes code or doc changes without a new changelog entry, the resulting Debian package has the same version number as the previous one.

The problem with this version recycling is that most Debian tooling assumes that the tuple of package name, version and architecture uniquely identifies a revision of a package. So stuffing a new version of a package with an old version number into a repository is bound to cause trouble; most repository management software simply refuses to accept that. On the target machine, upgrade the package won't do anything if the version number stays the same.

So, its a good idea to put a bit more thought into the version string of the automatically built Debian package.

Constructing Unique Version Numbers

There are several source that you can tap to generate unique version numbers:

  • Randomness (for example in the form of UUIDs)
  • The current date and time
  • The git repository itself
  • GoCD exposes several environment variables that can be of use

The latter is quite promising: GO_PIPELINE_COUNTER is a monotonic counter that increases each time GoCD runs the pipeline, so a good source for a version number. GoCD allows manual re-running of stages, so it's best to combine it with GO_STAGE_COUNTER. In terms of shell scripting, using $GO_PIPELINE_COUNTER.$GO_STAGE_COUNTER as a version string sounds like a decent approach.

But, there's more. GoCD allows you to trigger a pipeline with a specific version of a material, so you can have a new pipeline run to build an old version of the software. If you do that, using GO_PIPELINE_COUNTER as the first part of the version string doesn't reflect the use of an old code base.

To construct a version string that primarily reflects the version of the git repository, and only secondarily the build iteration, the first part of the version string has to come from git. As a distributed version control system, git doesn't supply a single, numeric version counter. But if you limit yourself to a single repository and branch, you can simply count commits.

git describe is an established way to count commits. By default it prints the last tag in the repo, and if HEAD does not resolve to the same commit as the tag, it adds the number of commits since that tag, and the abbreviated sha1 hash prefixed by g, so for example 2016.04-32-g4232204 for the commit 4232204, which is 32 commits after the tag 2016.04. The option --long forces it to always print the number of commits and the hash, even when HEAD points to a tag.

We don't need the commit hash for the version number, so a shell script to construct a good version number looks like this:

#!/bin/bash

set -e
set -o pipefail
version=$(git describe --long |sed 's/-g[A-Fa-f0-9]*$//')
version="$version.${GO_PIPELINE_COUNTER:-0}.${GO_STAGE_COUNTER:-0}"

Bash's ${VARIABLE:-default} syntax is a good way to make the script work outside a GoCD agent environment.

This script requires a tag to be set in the git repository. If there is none, it fails with this message from git describe:

fatal: No names found, cannot describe anything.

Other Bits and Pieces Around the Build

Now that we have a version string, we need to instruct the build system to use this version string. This works by writing a new entry in debian/changelog with the desired version number. The debchange tool automates this for us. A few options are necessary to make it work reliably:

export DEBFULLNAME='Go Debian Build Agent'
export DEBEMAIL='go-noreply@example.com'
debchange --newversion=$version  --force-distribution -b  \
    --distribution="${DISTRIBUTION:-jessie}" 'New Version'

When we want to reference this version number in later stages in the pipeline (yes, there will be more), it's handy to have it available in a file. It is also handy to have it in the output, so two more lines to the script:

echo $version
echo $version > ../version

And of course, trigger the actual build:

debuild -b -us -uc

Plugging It Into GoCD

To make the script accessible to GoCD, and also have it under version control, I put it into a git repository under the name debian-autobuild and added the repo as a material to the pipeline:

<pipeline name="package-info">
  <materials>
    <git url="https://github.com/moritz/package-info.git" dest="package-info" />
    <git url="https://github.com/moritz/deployment-utils.git" dest="deployment-utils" materialName="deployment-utils" />
  </materials>
  <stage name="build" cleanWorkingDir="true">
    <jobs>
      <job name="build-deb" timeout="5">
        <tasks>
          <exec command="../deployment-utils/debian-autobuild" workingdir="#{package}" />
        </tasks>
        <artifacts>
          <artifact src="version" />
          <artifact src="package-info*_*" dest="package-info/" />
        </artifacts>
      </job>
    </jobs>
  </stage>
</pipeline>

Now GoCD automatically builds Debian packages on each commit to the git repository, and gives each a distinct version string.

The next step is to add it to a repository, so that it can be installed on a target machine with a simple apt-get command.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Dave's Free Press: Journal: CPAN Testers' CPAN author FAQ

Ocean of Awareness: What parser do birds use?

"Here we provide, to our knowledge, the first unambiguous experimental evidence for compositional syntax in a non-human vocal system." -- "Experimental evidence for compositional syntax in bird calls", Toshitaka N. Suzuki, David Wheatcroft & Michael Griesser Nature Communications 7, Article number: 10986

In this post I look at a subset of the language of the Japanese great tit, also known as Parus major. The above cited article presents evidence that bird brains can parse this language. What about standard modern computer parsing methods? Here is the subset -- probably a tiny one -- of the language actually used by Parus major.

      S ::= ABC
      S ::= D
      S ::= ABC D
      S ::= D ABC
    

Classifying the Parus major grammar

Grammophone is a very handy new tool for classifying grammars. Its own parser is somewhat limited, so that it requires a period to mark the end of a rule. The above grammar is in Marpa's SLIF format, which is smart enough to use the "::=" operator to spot the beginning and end of rules, just as the human eye does. Here's the same grammar converted into a form acceptable to Grammophone:

      S -> ABC .
      S -> D .
      S -> ABC D .
      S -> D ABC .
    

Grammophone tells us that the Parus major grammar is not LL(1), but that it is LALR(1).

What does this mean?

LL(1) is the class of grammar parseable by top-down methods: it's the best class for characterizing most parsers in current use, including recursive descent, PEG, and Perl 6 grammars. All of these parsers fall short of dealing with the Parus major language.

LALR(1) is probably most well-known from its implementations in bison and yacc. While able to handle this subset of Parus's language, LALR(1) has its own, very strict, limits. Whether LALR(1) could handle the full complexity of Parus language is a serious question. But it's a question that in practice would probably not arise. LALR(1) has horrible error handling properties.

When the input is correct and within its limits, an LALR-driven parser is fast and works well. But if the input is not perfectly correct, LALR parsers produce no useful analysis of what went wrong. If Parus hears "d abc d", a parser like Marpa, on the other hand, can produce something like this:

# * String before error: abc d\s
# * The error was at line 1, column 7, and at character 0x0064 'd', ...
# * here: d
    

Parus uses its language in predatory contexts, and one can assume that a Parus with a preference for parsers whose error handling is on an LALR(1) level will not be keeping its alleles in the gene pool for very long.

References, comments, etc.

Those readers content with sub-Parus parsing methods may stop reading here. Those with greater parsing ambitions, however, may wish to learn more about Marpa. A Marpa test script for parsing the Parus subset is in a Github gist. Marpa has a semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: Thankyou, Anonymous Benefactor!

Dave's Free Press: Journal: Number::Phone release

Dave's Free Press: Journal: Ill

Dave's Free Press: Journal: CPANdeps upgrade

Perlgeek.de : Automating Deployments: Smoke Testing and Rolling Upgrades

In the last installment I talked about unit testing that covers the logic of your application. Unit testing is a good and efficient way to ensure the quality of the business logic, however unit tests tend to test components in isolation.

You should also check that several components work together well, which can be done with integration tests or smoke tests. The distinction between these two is a bit murky at times, but typically integration tests are still done somewhat in isolation, whereas smoke tests are run against an installed copy of the software in a complete environment, with all external services available.

A smoke test thus goes through the whole software stack. For a web application, that typically entails a web server, an application server, a database, and possibly integration points with other services such as single sign-on (SSO) or external data sources.

When to Smoke?

Smoke tests cover a lot of ground at once. A single test might require a working network, correctly configured firewall, web server, application server, database, and so on to work. This is an advantage, because it means that it can detect a big class of errors, but it is also a disadvantage, because it means the diagnostic capabilities are low. When it fails, you don't know which component is to blame, and have to investigate each failure anew.

Smoke tests are also much more expensive than unit tests; they tend to take more time to write, take longer to execute, and are more fragile in the face of configuration or data changes.

So typical advice is to have a low number of smoke tests, maybe one to 20, or maybe around one percent of the unit tests you have.

As an example, if you were to develop a flight search and recommendation engine for the web, your unit tests would cover different scenarios that the user might encounter, and that the engine produces the best possible suggestions. In smoke tests, you would just check that you can enter the starting point, destination and date of travel, and that you get a list of flight suggestions at all. If there is a membership area on that website, you would test that you cannot access it without credentials, and that you can access it after logging in. So, three smoke tests, give or take.

White Box Smoke Testing

The examples mentioned above are basically black-box smoke testing, in that they don't care about the internals of the application, and approach the application just like a user. This is very valuable, because ultimately you care about your user's experience.

But sometimes some aspects of the application aren't easy to smoke test, yet break often enough to warrant automated smoke tests. A practical solution is to offer some kind of self diagnosis, for example a web page where the application tests its own configuration for consistency, checks that all the necessary database tables exist, and that external services are reachable.

Then a single smoke test can call the status page, and throw an error whenever either the status page is not reachable, or reports an error. This is a white box smoke test.

Status pages for white box smoke tests can be reused in monitoring checks, but it is still a good idea to explicitly check it as part of the deployment process.

White box smoke testing should not replace black box smoke testing, but rather complement it.

An Example Smoke Test

The matheval application from the previous blog post offers a simple HTTP endpoint, so any HTTP client will do for smoke testing.

Using the curl command line HTTP client, a possible request looks like this:

$ curl  --silent -H "Accept: application/json" --data '["+", 37, 5]' -XPOST  http://127.0.0.1:8800/
42

An easy way to check that the output matches expectations is by piping it through grep:

$ curl  --silent -H "Accept: application/json" --data '["+", 37, 5]' -XPOST  http://127.0.0.1:8800/ | grep ^42$
42

The output is the same as before, but the exit status is non-zero if the output deviates from the expectation.

Integration the Smoke Testing Into the Pipeline

One could add a smoke test stage after each deployment stage (that is, one after the test deployment, one after the production deployment).

This setup would prevent a version of your application from reaching the production environment if it failed smoke tests in the testing environment. Since the smoke test is just a shell command that indicates failure with a non-zero exit status, adding it as a command in your deployment system should be trivial.

If you have just one instance of your application running, this is the best you can do. But if you have a farm of servers, and several instances of the application running behind some kind of load balancer, it is possible to smoke test each instance separately during an upgrade, and abort the upgrade if too many instances fail the smoke test.

All big, successful tech companies guard their production systems with such partial upgrades guarded by checks, or even more elaborate versions thereof.

A simple approach to such a rolling upgrade is to write an ansible playbook for the deployment of each package, and have it run the smoke tests for each machine before moving to the next:

# file smoke-tests/python-matheval
#!/bin/bash
curl  --silent -H "Accept: application/json" --data '["+", 37, 5]' -XPOST  http://$1:8800/ | grep ^42$


# file ansible/deploy-python-matheval.yml
---
- hosts: web
  serial: 1
  max_fail_percentage: 1
  tasks:
    - apt: update_cache=yes package=python-matheval={{package_version}} state=present force=yes
    - local_action: command ../smoke-tests/python-matheval "{{ansible_host}}"
      changed_when: False

As the smoke tests grow over time, it is not practical to cram them all into the ansible playbook, and doing that also limits reusability. Instead here they are in a separate file in the deployments utils repository. Another option would be to build a package from the smoke tests and install them on the machine that ansible runs on.

While it would be easy to execute the smoke tests command on the machine on which the service is installed, running it as a local action (that is, on the control host where the ansible playbook is started) also tests the network and firewall part, and thus more realistically mimics the actual usage scenario.

GoCD Configuration

To run the new deployment playbook from within the GoCD pipeline, change the testing deployment job in the template to:

        <tasks>
          <fetchartifact pipeline="" stage="build" job="build-deb" srcfile="version" />
          <exec command="/bin/bash" workingdir="deployment-utils/ansible/">
            <arg>-c</arg>
            <arg>ansible-playbook --inventory-file=testing --extra-vars=package_version=$(&lt; ../../version) #{deploy_playbook}</arg>
          </exec>
        </tasks>

And the same for production, except that it uses the production inventory file. This change to the template also changes the parameters that need to be defined in the pipeline definition. In the python-matheval example it becomes

  <params>
    <param name="distribution">jessie</param>
    <param name="package">python-matheval</param>
    <param name="deploy_playbook">deploy-python-matheval.yml</param>
  </params>

Since there are two pipelines that share the same template, the second pipeline (for package package-info) also needs a deployment playbook. It is very similar to the one for python-matheval, it just lacks the smoke test for now.

Conclusion

Writing a small amount of smoke tests is very beneficial for the stability of your applications.

Rolling updates with integrated smoke tests for each system involved are pretty easy to do with ansible, and can be integrated into the GoCD pipeline with little effort. They mitigate the damage of deploying a bad version or a bad configuration by limiting it to one system, or a small number of systems in a bigger cluster.

With this addition, the deployment pipeline is likely to be as least as robust as most manual deployment processes, but much less effort, easier to scale to more packages, and gives more insight about the timeline of deployments and installed versions.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Perlgeek.de : Story Time: Rollbacks Saved the Day

At work we develop, among other things, an internally used web application based on AngularJS. Last Friday, we received a rather urgent bug report that in a co-worker's browser, a rather important page wouldn't load at all, and show three empty error notifications.

Only one of our two frontend developers was present, and she didn't immediately know what was wrong or how to fix it. And what's worse, she couldn't reproduce the problem in her own browser.

A quick look into our Go CD instance showed that the previous production deployment of this web application was on the previous day, and we had no report of similar errors previous to that, so we decided to do a rollback to the previous version.

I recently blogged about deploying specific versions, which allows you to do rollbacks easily. And in fact I had implemented that technique just two weeks before this incident. To make the rollback happen, I just had to click on the "rerun" button of the installation stage of the last good deployment, and wait for a short while (about a minute).

Since I had a browser where I could reproduce the problem, I could verify that the rollback had indeed solved the problem, and our co-worker was able to continue his work. This took the stress off the frontend developers, who didn't have to hurry to fix the bug in a haste. Instead, they just had to fix it before the next production deployment. Which, in this case, meant it was possible to wait to the next working day, when the second developer was present again.

For me, this episode was a good validation of the rollback mechanism in the deployment pipeline.

Postskriptum: The Bug

In case you wonder, the actual bug that triggered the incident was related to caching. The web application just introduced caching of data in the browser's local storage. In Firefox on Debian Jessie and some Mint versions, writing to the local storage raised an exception, which the web application failed to catch. Firefox on Ubuntu and Mac OS X didn't produce the same problem.

Curiously, Firefox was configured to allow the usage of local storage for this domain, the default of allowing around 5MB was unchanged, and both existing local storage and the new-to-be-written on was in the low kilobyte range. A Website that experimentally determines the local storage size confirmed it to be 5200 kb. So I suspect that there is a Firefox bug involved on these platforms as well.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required

Ocean of Awareness: Introduction to Marpa Book in progress

What follows is a summary of the features of the Marpa algorithm, followed by a discussion of potential applications. It refers to itself as a "monograph", because it is a draft of part of the introduction to a technical monograph on the Marpa algorithm. I hope the entire monograph will appear in a few weeks.

The Marpa project

The Marpa project was intended to create a practical and highly available tool to generate and use general context-free parsers. Tools of this kind had long existed for LALR and regular expressions. But, despite an encouraging academic literature, no such tool had existed for context-free parsing. The first stable version of Marpa was uploaded to a public archive on Solstice Day 2011. This monograph describes the algorithm used in the most recent version of Marpa, Marpa::R2. It is a simplification of the algorithm presented in my earlier paper.

A proven algorithm

While the presentation in this monograph is theoretical, the approach is practical. The Marpa::R2 implementation has been widely available for some time, and has seen considerable use, including in production environments. Many of the ideas in the parsing literature satisfy theoretical criteria, but in practice turn out to face significant obstacles. An algorithm may be as fast as reported, but may turn out not to allow adequate error reporting. Or a modification may speed up the recognizer, but require additional processing at evaluation time, leaving no advantage to compensate for the additional complexity.

In this monograph, I describe the Marpa algorithm as it was implemented for Marpa::R2. In many cases, I believe there are better approaches than those I have described. But I treat these techniques, however solid their theory, as conjectures. Whenever I mention a technique that was not actually implemented in Marpa::R2, I will always explicitly state that that technique is not in Marpa as implemented.

Features

General context-free parsing

As implemented, Marpa parses all "proper" context-free grammars. The proper context-free grammars are those which are free of cycles, unproductive symbols, and inaccessible symbols. Worst case time bounds are never worse than those of Earley's algorithm, and therefore never worse than O(n**3).

Linear time for practical grammars

Currently, the grammars suitable for practical use are thought to be a subset of the deterministic context-free grammars. Using a technique discovered by Joop Leo, Marpa parses all of these in linear time. Leo's modification of Earley's algorithm is O(n) for LR-regular grammars. Leo's modification also parses many ambiguous grammars in linear time.

Left-eidetic

The original Earley algorithm kept full information about the parse --- including partial and fully recognized rule instances --- in its tables. At every parse location, before any symbols are scanned, Marpa's parse engine makes available its information about the state of the parse so far. This information is in useful form, and can be accessed efficiently.

Recoverable from read errors

When Marpa reads a token which it cannot accept, the error is fully recoverable. An application can try to read another token. The application can do this repeatedly as long as none of the tokens are accepted. Once the application provides a token that is accepted by the parser, parsing will continue as if the unsuccessful read attempts had never been made.

Ambiguous tokens

Marpa allows ambiguous tokens. These are often useful in natural language processing where, for example, the same word might be a verb or a noun. Use of ambiguous tokens can be combined with recovery from rejected tokens so that, for example, an application could react to the rejection of a token by reading two others.

Using the features

Error reporting

An obvious application of left-eideticism is error reporting. Marpa's abilities in this respect are ground-breaking. For example, users typically regard an ambiguity as an error in the grammar. Marpa, as currently implemented, can detect an ambiguity and report specifically where it occurred and what the alternatives were.

Event driven parsing

As implemented, Marpa::R2 allows the user to define "events". Events can be defined that trigger when a specified rule is complete, when a specified rule is predicted, when a specified symbol is nulled, when a user-specified lexeme has been scanned, or when a user-specified lexeme is about to be scanned. A mid-rule event can be defined by adding a nulling symbol at the desired point in the rule, and defining an event which triggers when the symbol is nulled.

Ruby slippers parsing

Left-eideticism, efficient error recovery, and the event mechanism can be combined to allow the application to change the input in response to feedback from the parser. In traditional parser practice, error detection is an act of desperation. In contrast, Marpa's error detection is so painless that it can be used as the foundation of new parsing techniques.

For example, if a token is rejected, the lexer is free to create a new token in the light of the parser's expectations. This approach can be seen as making the parser's "wishes" come true, and I have called it "Ruby Slippers Parsing".

One use of the Ruby Slippers technique is to parse with a clean but oversimplified grammar, programming the lexical analyzer to make up for the grammar's short-comings on the fly. As part of Marpa::R2, the author has implemented an HTML parser, based on a grammar that assumes that all start and end tags are present. Such an HTML grammar is too simple even to describe perfectly standard-conformant HTML, but the lexical analyzer is programmed to supply start and end tags as requested by the parser. The result is a simple and cleanly designed parser that parses very liberal HTML and accepts all input files, in the worst case treating them as highly defective HTML.

Ambiguity as a language design technique

In current practice, ambiguity is avoided in language design. This is very different from the practice in the languages humans choose when communicating with each other. Human languages exploit ambiguity in order to design highly flexible, powerfully expressive languages. For example, the language of this monograph, English, is notoriously ambiguous.

Ambiguity of course can present a problem. A sentence in an ambiguous language may have undesired meanings. But note that this is not a reason to ban potential ambiguity --- it is only a problem with actual ambiguity.

Syntax errors, for example, are undesired, but nobody tries to design languages to make syntax errors impossible. A language in which every input was well-formed and meaningful would be cumbersome and even dangerous: all typos in such a language would be meaningful, and parser would never warn the user about errors, because there would be no such thing.

With Marpa, ambiguity can be dealt with in the same way that syntax errors are dealt with in current practice. The language can be designed to be ambiguous, but any actual ambiguity can be detected and reported at parse time. This exploits Marpa's ability to report exactly where and what the ambiguity is. Marpa::R2's own parser description language, the SLIF, uses ambiguity in this way.

Auto-generated languages

In 1973, Čulik and Cohen pointed out that the ability to efficiently parse LR-regular languages opens the way to auto-generated languages. In particular, Čulik and Cohen note that a parser which can parse any LR-regular language will be able to parse a language generated using syntax macros.

Second order languages

In the literature, the term "second order language" is usually used to describe languages with features which are useful for second-order programming. True second-order languages --- languages which are auto-generated from other languages --- have not been seen as practical, since there was no guarantee that the auto-generated language could be efficiently parsed.

With Marpa, this barrier is raised. As an example, Marpa::R2's own parser description language, the SLIF, allows "precedenced rules". Precedenced rules are specified in an extended BNF. The BNF extensions allow precedence and associativity to be specified for each RHS.

Marpa::R2's precedenced rules are implemented as a true second order language. The SLIF representation of the precedenced rule is parsed to create a BNF grammar which is equivalent, and which has the desired precedence. Essentially, the SLIF does a standard textbook transformation. The transformation starts with a set of rules, each of which has a precedence and an associativity specified. The result of the transformation is a set of rules in pure BNF. The SLIF's advantage is that it is powered by Marpa, and therefore the SLIF can be certain that the grammar that it auto-generates will parse in linear time.

Notationally, Marpa's precedenced rules are an improvement over similar features in LALR-based parser generators like yacc or bison. In the SLIF, there are two important differences. First, in the SLIF's precedenced rules, precedence is generalized, so that it does not depend on the operators: there is no need to identify operators, much less class them as binary, unary, etc. This more powerful and flexible precedence notation allows the definition of multiple ternary operators, and multiple operators with arity above three.

Second, and more important, a SLIF user is guaranteed to get exactly the language that the precedenced rule specifies. The user of the yacc equivalent must hope their syntax falls within the limits of LALR.

References, comments, etc.

Marpa has a semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: YAPC::Europe 2006 report: day 3

Perlgeek.de : Automated Deployments: Unit Testing

Automated testing is absolutely essential for automated deployments. When you automate deployments, you automatically do them more often than before, which means that manual testing becomes more effort, more annoying, and is usually skipped sooner or later.

So to maintain a high degree of confidence that a deployment won't break the application, automated tests are the way to go.

And yet, I've written twenty blog posts about automating deployments, and this is the first about testing. Why did I drag my feet like this?

For one, testing is hard to generalize. But more importantly, the example project used so far doesn't play well with my usual approach to testing.

Of course one can still test it, but it's not an idiomatic approach that scales to real applications.

The easy way out is to consider a second example project. This also provides a good excuse to test the GoCD configuration template, and explore another way to build Debian packages.

Meet python-matheval

python-matheval is a stupid little web service that accepts a tree of mathematical expressions encoded in JSON format, evaluates it, and returns the result in the response. And as the name implies, it's written in python. Python3, to be precise.

The actual evaluation logic is quite compact:

# file src/matheval/evaluator.py
from functools import reduce
import operator

ops = {
    '+': operator.add,
    '-': operator.add,
    '*': operator.mul,
    '/': operator.truediv,
}

def math_eval(tree):
    if not isinstance(tree, list):
        return tree
    op = ops[tree.pop(0)]
    return reduce(op, map(math_eval, tree))

Exposing it to the web isn't much effort either, using the Flask library:

# file src/matheval/frontend.py
#!/usr/bin/python3

from flask import Flask, request

from matheval.evaluator import math_eval

app = Flask(__name__)

@app.route('/', methods=['GET', 'POST'])
def index():
    tree = request.get_json(force=True)
    result = math_eval(tree);
    return str(result) + "\n"

if __name__ == '__main__':
    app.run(debug=True)

The rest of the code is part of the build system. As a python package, it should have a setup.py in the root directory

# file setup.py
!/usr/bin/env python

from setuptools import setup

setup(name='matheval',
      version='1.0',
      description='Evaluation of expression trees',
      author='Moritz Lenz',
      author_email='moritz.lenz@gmail.com',
      url='https://deploybook.com/',
      package_dir={'': 'src'},
      requires=['flask', 'gunicorn'],
      packages=['matheval']
     )

Once a working setup script is in place, the tool dh-virtualenv can be used to create a Debian package containing the project itself and all of the python-level dependencies.

This creates rather large Debian packages (in this case, around 4 MB for less than a kilobyte of actual application code), but on the upside it allows several applications on the same machine that depend on different versions of the same python library. The simple usage of the resulting Debian packages makes it well worth in many use cases.

Using dh-virtualenv is quite easy:

# file debian/rules
#!/usr/bin/make -f
export DH_VIRTUALENV_INSTALL_ROOT=/usr/share/python-custom

%:
    dh $@ --with python-virtualenv --with systemd

override_dh_virtualenv:
    dh_virtualenv --python=/usr/bin/python3

See the github repository for all the other boring details, like the systemd service files and the control file.

The integration into the GoCD pipeline is easy, using the previously developed configuration template:

<pipeline name="python-matheval" template="debian-base">
  <params>
    <param name="distribution">jessie</param>
    <param name="package">python-matheval</param>
    <param name="target">web</param>
  </params>
  <materials>
    <git url="https://github.com/moritz/python-matheval.git" dest="python-matheval" materialName="python-matheval" />
    <git url="https://github.com/moritz/deployment-utils.git" dest="deployment-utils" materialName="deployment-utils" />
  </materials>
</pipeline>

Getting Started with Testing, Finally

It is good practise and a good idea to cover business logic with unit tests.

The way that evaluation logic is split into a separate function makes it easy to test said function in isolation. A typical way is to feed some example inputs into the function, and check that the return value is as expected.

# file test/test-evaluator.py
import unittest
from matheval.evaluator import math_eval

class EvaluatorTest(unittest.TestCase):
    def _check(self, tree, expected):
        self.assertEqual(math_eval(tree), expected)

    def test_basic(self):
        self._check(5, 5)
        self._check(['+', 5], 5)
        self._check(['+', 5, 7], 12)
        self._check(['*', ['+', 5, 4], 2], 18)

if __name__ == '__main__':
    unittest.main()

One can execute the test suite (here just one test file so far) with the nosetests command from the nose python package:

$ nosetests
.
----------------------------------------------------------------------
Ran 1 test in 0.004s

OK

The python way of exposing the test suite is to implement the test command in setup.py, which can be done with the line

test_suite='nose.collector',

in the setup() call in setup.py. And of course one needs to add nose to the list passed to the requires argument.

With these measures in place, the debhelper and dh-virtualenv tooling takes care of executing the test suite as part of the Debian package build. If any of the tests fail, so does the build.

Running the test suite in this way is advantageous, because it runs the tests with exactly the same versions of all involved python libraries as end up in Debian package, and thus make up the runtime environment of the application. It is possible to achieve this through other means, but other approaches usually take much more work.

Conclusions

You should have enough unit tests to make you confident that the core logic of your application works correctly. It is a very easy and pragmatic solution to run the unit tests as part of the package build, ensuring that only "good" versions of your software are ever packaged and installed.

In future blog posts, other forms of testing will be explored.


I'm writing a book on automating deployments. If this topic interests you, please sign up for the Automating Deployments newsletter. It will keep you informed about automating and continuous deployments. It also helps me to gauge interest in this project, and your feedback can shape the course it takes.

Subscribe to the Automating Deployments mailing list

* indicates required
Header image by Tambako the Jaguar. Some rights reserved.