Digging Deeper into Autoconf (AT/2)

I left off last time with a brief overview of autoconf. This time I’m going to dig a bit deeper. I’ll show you how autoconf works from the inside out, and give you a few insights on where to look when you can’t find the answer in the documentation – or even in a Google search.

Transforming Text

Autoconf is a text stream transforming tool. The input is your configure.ac file, some autoconf input files, and several autoconf macro definition files. The output is your configure script. The transforming tool is simply the m4 macro processor. To be clear, autoconf does little more than call m4 on your configure.ac file (sometimes called configure.in – an older, deprecated naming convention for autoconf input files) to generate your configure script. This fact is almost worth repeating, because it’s the root cause of most people’s frustration with autoconf. Thus, a solid understanding of m4 syntax and semantics is key to understanding why certain issues crop up in configure.ac files.

An Autoconf-Oriented M4 Tutorial

Let’s begin with a short tutorial on m4 – I mean long enough to be useful, but short enough to be interesting. I’ll tell you just what you need to know to use autoconf well. Check out the GNU m4 manual for a much more in-depth treatise on the subject.

The fundamental job of a macro processor is text replacement. Sounds simple, but all sorts of problems crop up when the text you’re trying to replace contains the meta-characters or strings that are supposed to delineate text you’re trying to replace.

There are two basic approaches to this problem. The first is to choose meta-characters or delineation strings that no one would ever consider using in their document. The second is to define some sort of escaping mechanism that will allow authors to indicate to the processor which meta-characters or delineation strings should be taken literally, and which should be considered by the processor as such. This is called quoting or escaping. Since m4 is a general purpose macro processor, it’s difficult to conceive of a set of delimiter characters or words that would be considered unnecessary within the scope of all possible input languages. However, m4 uses both of these mechanisms to some degree. How this is so will become clear in a few paragraphs.

m4 parses a stream of input text into a series of tokens. Each token is interpreted in one of three different ways: as a name, as a quoted string, or as any other character that is not a name or a quoted string. In addition, m4 recognizes a secondary form of quoting specifically designed for comments.

Names are sequences of alpha-numeric characters (letters, digits and underscores), where the first character is not a digit. A quoted string is any sequence of text within m4 quote delimiters, which by default are the characters ` and ‘. Comments are sequences of characters delimited by the comment delimiters, which by default are the characters # and CR (the carriage-return, new-line or end-of-line delimiter).

All other characters in the input stream are treated individually as separate tokens. I emphasize this statement because understanding this concept is very important. It means that white space characters are each individual tokens. Unlike a C language lexical analyzer, which simply treats a run of white space characters as a token separator, m4 treats two consecutive SPACE characters as two separate individual tokens. I’ll wager that most of your configure.ac parsing and expansion problems will disappear when you fully grasp the meaning of this statement and then apply it to your build system.

m4 provides a built-in macro, changequote, that modifies the quoting delimiters. Because unbalanced single quotes are often found in shell script, and because unbalanced square brackets are rare in shell script, autoconf changes the m4 quote characters from ` and ‘ to [ and ].

Let’s take a short text file (which I’ve named sample1.in) through m4 and see what comes out. Play with the spacing on the text a bit and notice the changes to the output, bearing in mind my earlier comments on white space as tokens. In this example, I also introduce the dnl macro:

changequote([,])dnl
Some text that will pass through with no changes.
[[Here's a quoted line of text.]]
# Here's a comment

Like many Unix utilities, m4 is written as a filter, which means that, by default, it accepts input on STDIN and sends output to STDOUT. A single file name on the command line redirects STDIN from the file, and output may be redirected to a file if desired. Here is the m4 command and the output:

$ m4 sample1.in
Some text that will pass through with no changes.
[Here's a quoted line of text.]
# Here's a comment
$

The first thing to notice here is the results of processing the changequote and dnl macros. The changequote macro expands to nothing, but has some interesting side effects. You already (mostly) understand changequote, but dnl is new. This is another m4 built-in macro whose sole purpose is to discard the rest of a line of text, including the terminating CR, passing none of it through as output. Effectively dnl expands to less than nothing. The name of the macro is actually an acronym for “do not load”. Without dnl the output text would contain an extra CR before the word, “Some”. Why? Remember the rules: White space characters are tokens, not garbage. If it’s not a macro name or quoted text, then it’s passed through unchanged, along with all the other tokens in the input stream.

Another item of interest is the reduction in quote nesting on the quoted string. I passed in “[[text]]”, but I got back “[text]“. m4 removes one level of quoting with each pass through a stream of input text. This is often another sore spot with autoconf users. A solid understanding of m4 quote removal will fix some of the nastiest problems you’re likely to encounter when writing configure.ac files. When you define a macro with embedded quotes, the quotes are removed in the definition text. When you then call that macro, the call is replaced with the unquoted version of the text, and the expanded text is again processed for macro calls. If you wanted the expanded text to contain the quotes, then you’ll have to double quote it in the macro definition.

Macros themselves are words, optionally followed by a set of parenthesis encapsulating a comma-separated list of arguments. Each time m4 parses a word token, it scans its macro definition list for a match. If it finds one, then it looks for an open parenthesis as the next token. All macro arguments are optional, and may be omitted. Missing argument are treated by m4 as if the empty string had been passed. If leading or middle arguments are not needed, they may be omitted, but the separating comma characters must be in place. If a trailing argument is not needed it may simply be omitted, along with its preceeding comma separator.

In fact, the changequote macro is written in such a way that it may be used with zero, one or two arguments. Zero arguments means “reset to the default quote delimiters” – ` and ‘. In this case, the entire argument list may be omitted, including the parentheses. One argument means, “Use the parameter value as the opening quote delimiter, and CR as the closing quote delimiter”. Here’s another interesting example to lead us into the next aspect of macro expansion (sample2.in):

changequote([ ,])dnl
[Here's an (apparently) quoted line of text.]
[ Here's a real quoted line of text.]

Processing with m4 generates the following output:

$ m4 sample2.in
[Here's an (apparently) quoted line of text.]
Here's a real quoted line of text.
$

Can you see what happened here? White space matters. I added one space after the first [ in the changequote argument list. The opening quote is now “[ “, not “[“, and m4 honors this request. There are actually some strange rules about using white space in argument lists, and unfortunately they’re important, so here they are:

  • The opening parenthesis in a macro call must immediately follow the macro name, or the macro is called with no arguments and the argument list is treated as part of the input text stream.
  • If too few arguments are passed in a macro call, the missing arguments are each considered to be the empty string.
  • If too many arguments are passed in a macro call, m4 will discard the extra arguments up to the closing parenthesis.
  • Unquoted leading white space (SPACE, TAB, VTAB, CR, LF, and FF) are discarded, but unquoted trailing white space is considered part of the argument.

Regarding the last point in the list above, quoted white space is always considered part of an argument, so it pays to always quote your arguments. Even when you do, however, spacing can still bite you. I’ll prove it with another example (sample3.in):

changequote(`[' , `]')dnl
`Unquoted text'
[Unquoted text]
[ Quoted text]

Processing with m4 reveals some apparently strange behavior:

$ m4 sample3.in
`Unquoted text'
[Unquoted text]
Quoted text
$

“Wha…?”, you ask? Remember the rules: Trailing white space is always considered to be part of the argument, even if that trailing white space follows the closing quote of a quoted portion of the argument. An argument to a macro call is defined as the entire string of input tokens between the opening parenthesis or the comma following the previous argument, and the next comma or closing parenthesis, minus any leading unquoted white space. Okay, just one more example (sample4.in):

define(macro,$1)dnl
macro(

      `  '... text `quoted text'

      a
, ...)

In this example, macro is defined to expand to its first argument. The result of calling m4 on sample4.in is then:

$ m4 sample4.in
  ... text quoted text

      a

$

Leading unquoted white space; spaces, tabs, line feeds, etc., are discarded, but everything else up to the comma is part of the argument.

What Comes with Autoconf?

Well, that’s enough of m4 for now, but I’ll return to it later when I discuss defining your own autoconf macros. In the meantime, let’s look at what comes with the autoconf distribution. I’m going to look at the file list that installs with an autoconf rpm on an rpm-based Linux distribution. Sometimes I find that it’s easier to comprehend a package if I know what’s in it, so long ago I learned to use the rpm -ql command to list the files installed by a package:

$ rpm -ql autoconf
/usr/bin/autoconf
/usr/bin/autoheader
/usr/bin/autom4te
/usr/bin/autoreconf
/usr/bin/autoscan
/usr/bin/autoupdate
/usr/bin/ifnames
/usr/share/autoconf/Autom4te
/usr/share/autoconf/Autom4te/C4che.pm
/usr/share/autoconf/Autom4te/ChannelDefs.pm
/usr/share/autoconf/Autom4te/Channels.pm
/usr/share/autoconf/Autom4te/Configure_ac.pm
/usr/share/autoconf/Autom4te/FileUtils.pm
/usr/share/autoconf/Autom4te/General.pm
/usr/share/autoconf/Autom4te/Request.pm
/usr/share/autoconf/Autom4te/Struct.pm
/usr/share/autoconf/Autom4te/XFile.pm
/usr/share/autoconf/INSTALL
/usr/share/autoconf/autoconf/autoconf.m4
/usr/share/autoconf/autoconf/autoconf.m4f
/usr/share/autoconf/autoconf/autoheader.m4
/usr/share/autoconf/autoconf/autoscan.m4
/usr/share/autoconf/autoconf/autotest.m4
/usr/share/autoconf/autoconf/autoupdate.m4
/usr/share/autoconf/autoconf/c.m4
/usr/share/autoconf/autoconf/fortran.m4
/usr/share/autoconf/autoconf/functions.m4
/usr/share/autoconf/autoconf/general.m4
/usr/share/autoconf/autoconf/headers.m4
/usr/share/autoconf/autoconf/lang.m4
/usr/share/autoconf/autoconf/libs.m4
/usr/share/autoconf/autoconf/oldnames.m4
/usr/share/autoconf/autoconf/programs.m4
/usr/share/autoconf/autoconf/specific.m4
/usr/share/autoconf/autoconf/status.m4
/usr/share/autoconf/autoconf/types.m4
/usr/share/autoconf/autom4te.cfg
/usr/share/autoconf/autoscan/autoscan.list
/usr/share/autoconf/autotest/autotest.m4
/usr/share/autoconf/autotest/autotest.m4f
/usr/share/autoconf/autotest/general.m4
/usr/share/autoconf/m4sugar/m4sh.m4
/usr/share/autoconf/m4sugar/m4sh.m4f
/usr/share/autoconf/m4sugar/m4sugar.m4
/usr/share/autoconf/m4sugar/m4sugar.m4f
/usr/share/autoconf/m4sugar/version.m4

The only files I removed from this list (for the sake of brevity) were documentation files and directory names. That said, what you see here is the entire package. Now, let’s consider the set of files. The installed binaries include:

  • autoconf
  • autoheader
  • autom4te (pronounced “automate” – a bit of leet-speak here)
  • autoreconf
  • autoscan
  • autoupdate
  • ifnames.

I’ve already briefly discussed autoscan. The autoupdate utility is simliar to autoscan, except that it updates an existing autoconf project to the current version of autoconf. If you wrote your project to an eariler version of autoconf, then autoupdate will update your configure.ac syntax to that of the currently installed version.

The autoheader and autom4te utilities, along with the .pm files used by autom4te, really deserve their own articles, so I’ll skip these for now. The ifnames program is used internally by autoconf (although you can read the man page and use it if you want to).

The autoreconf utility updates all of your generated files if they are older than the files used to generate them. Basically, you can quickly ensure that your build system is up to date by running autoreconf, instead of autoconf. If all files are already up to date, then nothing happens.

What’s left are all .m4 files. These files are all macro files and base configuration files used by autoconf to generate your configure script from your configure.ac file.

Finding Macro Definitions

Now that you know something about m4, and where autoconf files are installed, it’s a simple matter of recursively grep’ing the /usr/share/autoconf directory for the name of the macro to find out how it’s defined. Okay, it’s not that simple, but it’s helped me lots of times to figure out just what a macro actually does, despite what the manual tells me it does. Additionally, sometimes the manual just isn’t detailed enough to help me understand the full scope of a macro. And google searches just don’t do the trick when you’re trying to figure out how to use an autoconf macro.

Summary

Okay, that’s enough for today. If you understood the examples and followed the discussion without too much head scratching, then you passed. Pat yourself on the back, but don’t stop there. Experiment with m4 to get a feel for it. Just run ‘m4′ from the command prompt and start entering text. In an interactive m4 session, each line of text is processed immediately, but the results of previous lines are remembered for the session (use Ctrl-D to quit). For example, if you call changequote, the quotes will be changed for subsequent input lines. Playing with m4 will help you to get a handle on the rules. The rules are simple, but rather strict, so you must internalize them in order to become proficient at writing autoconf input files.

Next time I’ll focus on some of the more important predefined autoconf macros.

About these ads

One comment on “Digging Deeper into Autoconf (AT/2)

  1. Eugueny says:

    Thank you for writing this. Please keep going, you’re my newest bookmark in Firefox. I started Linux development less than a year ago and for now I’ve managed to avoid autotools alltogether, only because I haven’t shipped anything yet.

    Thank you.

Comments are closed.