.\" this document requires the tmac.wrprc macros
.\"
.\" $(TROFF) $(MSMACROS) tmac.wrprc thisfile
.\"
.\" revision date - change whenever this file is edited
.ds RD 18 May 1997
.\"
.EH 'troffcvt \*- A troff Converter'- % -''
.OH ''- % -'troffcvt \*- A troff Converter'
.OF 'Revision date:\0\0\*(RD''Printed:\0\0\n(dy \*(MO 19\n(yr'
.EF 'Revision date:\0\0\*(RD''Printed:\0\0\n(dy \*(MO 19\n(yr'
.\"
.de St	\" troffcvt special text
\\&\\$3\fB@\\$1\fR\\$2
..
.de Cl	\" troffcvt control
\\&\\$3\fB\e\\$1\fR\\$2
..
.de Rq	\" troff request
\\&\\$3\fB\.\\$1\fR\\$2
..
.de Es	\" troff escape
\\&\\$3\fB\e\\$1\fR\\$2
..
.de Ac	\" action
.LP
.ta \n(LLuR
\(bu \\$1	(\fB\\$2\fR)
.br
..
.TL
.ps +2
troffcvt \*- A troff Converter
.ps
.AU
Paul DuBois
.H*ahref mailto:dubois@primate.wisc.edu
dubois@primate.wisc.edu
.H*aend
.AI
.H*ahref http://www.primate.wisc.edu/
Wisconsin Regional Primate Research Center
.H*aend
Revision date:\0\0\*(RD
.AB
Yet another
.I troff -related
program, with yet another set of misassumptions about how
.I troff
interprets its input, and its own set of deficiencies and bugs.
.AE
.\"
.H*toc*title "Table of Contents"
.\"
.Ah Introduction
.\"
.LP
This document describes
.I troffcvt
(``\fItroff\fR convert''), a program which assists in the process of converting
.I troff
documents to other formats.
.I troffcvt
doesn't do the full job of translation itself; rather, it is a preprocessor
that turns
.I troff
files into an intermediate format with a syntax that is easier to
interpret than the raw
.I troff
input language.
.I troffcvt
is
intended as a front end that supplies input to a postprocessor which
finishes the translation to produce output in the target format.
Since the job of writing a translator for a given target
format then need not include writing a
.I troff -parser,
the burden of the translator writer is reduced.
In a sense,
.I troffcvt
is simply another sort of
.I ditroff ,
one that produces a different output language than does
.I ditroff .
.LP
.I troffcvt
started out as a
.I sed
script for converting
.I troff
to RTF (Rich Text Format),
but it quickly became evident that that wasn't going to be a
very simple job to do correctly.
It seemed a better justification of effort
to write a more general tool that would be useful in contexts other than
that of RTF production.
The source distribution contains some example
translation methods (simple postprocessors) you can look at.
A standard
.I troffcvt
output reader is included in the
distribution; it can be configured for use with your own postprocessors.
.LP
.I troffcvt
has a number of significant shortcomings.
It doesn't do very well with input that has been passed through
.I tbl ,
.I eqn
or
.I pic .
(For input containing tables, you can use
.I tblcvt
rather than
.I tbl
to get better results.)
.I troff
constructs that involve determination of motion or sizes sometimes are
calculated inaccurately since
.I troffcvt
knows nothing of font metrics.
Conditional construct processing is problematic also, as are position-dependent
traps.
.I troffcvt
has other limitations; if those just listed are
insufficient to dissuade you from using it as the basis for a
translator, see the document
.H*ahref bugs.html
.I "troffcvt \*- Notes, Bugs, Deficiencies\c"
.H*aend
\&.
.\"
.Ah "Translation Model"
.\"
.LP
.I troff
files consist of text to be formatted interspersed with markup
requests indicating how the formatting is to be done.
The language implemented by
.I troff
is essentially an inverted programming language, where document text
comprises the comments and markup requests provide the program
indicating how to format the comments.
This language isn't especially easy to parse, which may be why there
are few tools for translating
.I troff
documents into other formats.
Most tools that do exists seem to use pattern match-based transformations,
rather than making any attempt to actually understand the
.I troff
language.
The purpose of
.I troffcvt
is to make it easier to write
.I troff -to-XXX
translators, for arbitrary XXX, by doing
the hard work of turning
.I troff
input into something easier to interpret.
This means part of the job is already done for
postprocessor writers, who can then concentrate on producing
output in the desired target format rather than on figuring out
how to understand
.I troff
files.
.LP
For example, point size might be set by some disgusting sequence like this:
.Ps
.ne 5
\&.ds x *a
\&.ds y *b
\&.nr \e(\e*x\e(\e*x 12
\&.ds \e(\e*y\e(\e*y \e\en(\e(\e*x\e(\e*x
\&.ps \e*(\e(\e*y\e(\e*y
.Pe
This is digested by
.I troffcvt
and appears in the output in somewhat simpler form:
.Ps
\epoint-size 12
.Pe
.I Caveat:
The
quality of translators obviously depends on the quality of
.I troffcvt 's
preprocessing, which is suspect.
Nor is the situation improved by the fact that various versions of
.I troff
sometimes do different things with identical input.
This makes it difficult for
.I troffcvt
to do the ``correct'' thing in all cases, especially for
input files that have been tailored to work with (i.e., around bugs in)
a particular version of
.I troff .
.LP
.I troffcvt
produces output that preserves
information about the structure of a document (e.g., margins, page
length) and its contents (the text it contains).
The goal is
.I not
to lay out text on pages.
That is left to postprocessors, which are expected to lay out document
content by interpreting structure information.
Postprocessors may use the structure and/or content to varying extents.
For instance, a postprocessor that simply extracts the text would ignore
the structure information.
A postprocessor that produces a summary of the structure
(e.g., page layout information) would ignore the text.
Most postprocessors will fall somewhere between these extremes.
.LP
Inevitably, a certain amount of information is lost.
Usually this results from not knowing all the characteristics of
the output device.
For instance, no font metric information is used, so it's not possible
to determine the position on the current page, or even to know
what the current page number is.
.LP
An example of
.I troffcvt
operation is shown below.
(The default resolution of 432 units/inch is assumed.)
.Ps
.ta 3i
\f(CBInput	Output\fP

\&.ps 14	\epoint-size 14
\&.vs 16	\espacing 96
\&.ce	\ecenter	
\&.ft B	\efont B	
troffcvt\e\-a troff converter	troffcvt	
\&.ft	@minus	
	a troff converter	
	\ebreak	
	\eadjust-full	
	\efont R	
.Pe
.\"
.Ah "troffcvt Output"
.\"
.LP
.I troffcvt
produces a mixture of control and text lines.
Control lines correspond to document structure.
They consist of a backslash character
.Cw \e
followed by a control word and possibly some parameters for the control word,
e.g.,
.Cl space ,
.Cl "font R" ,
.Cl "page-length 4752" .
Text lines correspond to document content, and are either plain text
written literally to the output,
or begin with a ``@'' to indicate special characters
(for instance,
.St bullet
for the ``\(bu'' character or
.St alpha
for ``\(*a'').
.LP
None of the control or special-text keywords overlap, but it's still
convenient to use different
leading characters
.B \e
and
.B @
to make it easier for simple filter
programs to distinguish between them.
For example, the following command strips control lines
from a file containing
.I troffcvt
output:
.Ps
% \f(CBsed \-e "/^\|\e\e\|/d"\fP \f(CIfilename\fP
.Pe
.I troffcvt
output is rife with
.I troff -isms,
such as
.Cl need
and
.Cl embolden .
Little effort was made to map these to more general document layout
concepts since it's not clear what gain, if any, there would be in doing so.
.\"
.Ah "How troffcvt Works"
.\"
.LP
There are two steps to turning
.I troff
files into some other format:
.Ls B
.Li
Run
.I troffcvt
to turn
.I troff
files into input for your postprocessor.
.Li
Run the postprocessor to generate final output.
.Le
.I troffcvt
is configured by means of action files, which are described below.
.I troffcvt
postprocessor writing is a separate issue from understanding how
.I troffcvt
itself works, and is covered in a separate document,
.H*ahref output.html
.I "troffcvt Output Format and Postprocessor Writing\c"
.H*aend
\&.
.LP
Probably the easiest way to get some idea of the relationship of
.I troffcvt 's
input and output is to run some
.I troff
files through it and look at what comes out.
When
.I troffcvt
runs, it
reads one or more action files to configure itself, then processes input
files according to the information in the action files.
These are text files
containing symbolic actions that specify what happens when
requests occur.
Action files are also used
to define special characters and to set processing parameters.
.LP
.I troffcvt
doesn't have built-in knowledge about any
.I troff
request.
Stated another way, unless
.I troffcvt
is told how to implement a given
.I troff
request by means of some action file, it ignores that request.
It also knows about very few of the characters that have special meaning
(by design, since these vary from one version of
.I troff
to another).
All of this stuff has to be specified in an action file.
By default,
.I troffcvt
reads the action file
.I actions
when it runs.
You can also name additional action files on the command line using
the
.B \-a
option.
.LP
The format of action files is simple.
Blank lines are ignored.
Lines beginning with a ``#'' character are also ignored, so you can use them
to include comments.
Actions are specified on a line consisting of a leading keyword to
indicate the action
type (\fBimm\fR or \fBreq\fR), followed by an action list of
zero or more actions.
(An action line may be continued to the next line by putting a backslash
at the end of the line.)
Action lists can be executed immediately at the time the action file is read,
or they can be associated with a request, to be executed whenever the
request occurs in the input.
.LP
Immediate actions consist of the word
.B imm
followed by an action list that is executed as
soon as it has been read.
The first
.B imm
line below sets the point size to 10 points and vertical spacing to 12
points, whereas the second sets the font to roman:
.Ps
imm point-size 10 spacing 12p
imm font R
.Pe
Request actions are similar but specify a request name, a set of actions
for parsing the request's arguments, and a set of actions for processing
those arguments after they have been parsed:
.Ps
req \f(CIrequest-name parsing-actions\fP eol \f(CIpost-parsing-actions\fP ...
.Pe
.I request-name
is the name of the request (without the leading period).
The
.I parsing-actions
section specifies how to parse the arguments expected by the request.
If
.I parsing-actions
is empty, no request arguments are expected (or are to be ignored).
The
.B eol
keyword is mandatory.
It signifies the end of the parsing actions and causes
.I troffcvt
to skip to the end of the request line.
(If this were not done, the remaining part of the request line
would be read as a separate line to be processed.)
The
.I post-parsing-actions
section specifies what should happen after the request arguments have been
parsed.
Typically this involves interpreting the request arguments.
If the
.I post-parsing-actions
section is empty, nothing is done with the request (the request is ignored).
.LP
.I troffcvt
associates each action name with the number of arguments that should
follow the action when it occurs in action lists.
When an action is performed, any arguments specified
in the action list are passed to it.
For instance, the
.Rq so
request can be described like this:
.Ps
req so parse-filename eol push-file $1
.Pe
The
.B parse-filename
action parses the line on which the request occurs to find a filename.
This filename becomes the value of argument 1, which can be referred to
later as
.B $1 .
.B push-file
pushes the file named by
.B $1
on the input stack.
Since
.B $1
refers to the first argument parsed from the
.Rq so
request,
if the request is ``.so junk'', then ``push-file $1'' becomes
``push-file junk'',
and
.I junk
becomes the current input file.
.LP
The two
.B req
lines below show how the
.Rq ps
and
.Rq ce
requests can be defined:
.Ps
req ps parse-absrel-num x point-size eol point-size $1
req ce parse-num x eol center $1
.Pe
The actions to take when a
.Rq ps
request occurs are:
parse a number, which can be an absolute setting or relative to the current
point size; skip to the end of the
request line; set the point size using the previously parsed number.
The actions for
.Rq ce
are to
parse a number, skip to the end of the
request line, and cause the next ``$1'' input lines to be centered.
.LP
``Missing'' arguments are passed as empty strings.
A reference to
.B $n
is passed to the action as the empty string if no
.B n -th
argument was present on the input request line.
Suppose the
.Rq ds
request is defined like this:
.Ps
req ds parse-name parse-string-value y eol define-string $1 $2
.Pe
Then if the following input line occurs, the
.B parse-string-value
action will find no string on the line, and the
.B define-string
action will define
.B xx
as the empty string:
.Ps
\&.ds xx
.Pe
The language implemented by
.I troff
is expressive (if somewhat unwieldy), so a large number of actions seem to
be necessary to allow requests to be specified properly.
Descriptions of all actions are given in the
.H*ahref "actions.html"
.I "troffcvt Action Reference"
.H*aend
document.
.LP
If you don't like the
.I actions
file supplied with the
.I troffcvt
distribution, you can modify it as necessary for your own purposes.
.LP
Specifying
.I troffcvt 's
behavior in terms of symbolic actions rather than hardwiring them into the code
allows a good deal of flexibility, because
.I troffcvt 's
initial state and response to requests can be modified without changing
.I troffcvt
itself.
For example, different versions of
.I troff
often know about different sets of special characters; building the list
at runtime allows different versions to be accommodated.
The initial page layout can also be specified this way,
since although
initial values for processing parameters are the same as those given
in the Ossanna manual, you can change them.
Thus you can set up layouts for letter size, legal, A4, etc.
.LP
This method of configuring
.I troffcvt
also meansx you can experiment quite easily with
.I troffcvt 's
response to particular requests.
.\"
.Ah "Names and Objects"
.\"
.LP
Request, macro, string, and register definitions consist of two parts:
a name, and the underlying object to which the name points.
.I troffcvt
allows
.I groff -style
aliases to be created, such that referring to an alias name is the same
as referring to the original name.
Aliases are implemented by creating multiple names that all point to the
same underlying object.
The object structure contains a reference count indicating
how many names point to it.
.LP
For example, when a macro is defined, a name is allocated and pointed
at a macro object structure that holds the macro contents (the body
of the macro).
The reference count in the object structure is set to one.
If an alias to the macro is created, a new name is created and made to
point to the same object structure as the original name.
The reference count in the object structure is set to two.
When a request, macro, string, or register is removed, the name is
deallocated and the reference count is decremented.
If the count goes to zero, the underlying object is no longer needed
(no other names point to it), and the object structure is deallocated as well.
.LP
The reference count also includes the number of times an object is currently
in use.
When a request or macro is invoked or a string reference occurs,
the reference count of the underlying object is incremented.
When the request or macro terminates, or the end of the string is reached,
the count is decremented.
This use of the reference count has two purposes:
.Ls B
.Li
It prevents an object from being deallocated while it's in
use, even if all the names that point to it disappear while it's being used.
.Li
It allows infinite recursion to be detected easily.
If the count gets ``too large,'' it's assumed that a recursion loop has
occurred.
.Le
Consider the following example of a macro that removes itself:
.Ps
\&.de xx
\&.rm xx
\&..
.Pe
When the macro is defined initially, the name
.B xx
is created and made to point at a macro object, which is given a reference
count of one.
Invoking the macro results in the following actions:
.Ls B
.Li
When
.B xx
begins executing, the reference count is incremented to two.
.Li
When
.B xx
removes itself, the name
.B xx
is removed from the name list and the reference count in the macro object
is decremented.
Since the count goes from two to one (not zero), the macro object is not
deallocated, even though no name points to it any more.
(This is a good thing, because otherwise we'd be deallocating the current
input source!)
.Li
When
.B xx
terminates, the reference count of the macro object is decremented again.
This time the count goes to zero, and the macro object is deallocated.
.Le
Now consider the following slightly more complicated macro, which
removes itself after creating an alias to itself:
.Ps
\&.de xx
\&.als yy xx
\&.rm xx
\&..
.Pe
The reference count is set to one when the macro is defined,
two when the macro begins executing, three when the alias is created,
two when
.B xx
removes itself, and one when
.B xx
terminates.
In this case, however, since the reference count is still one when
.B xx
terminates (the name
.B yy
still points to the underlying macro object),
the object is not deallocated.
.LP
Aliases provide a convenient way to implement the
.Rq rn
request.
The new name is created as an alias for the existing name, and thus
points to the same underlying object.
The old name is then removed, but since the underlying object is now pointed
to by another name, it persists as it should.
.\"
.Ah "Macro Package Handling"
.\"
.LP
.I troff
is commonly invoked with some sort of
.B \-m \fIxx\fR
flag (e.g.,
.B \-man ,
.B \-me ,
.B \-mm ,
.B \-ms ),
so these need to be handled by
.I troffcvt
as well.
There are several ways of handling a macro package, some better than others:
.Ls B
.Li
Ignore it.
Since none of the macros referenced in the input file then will have any
definitions as far as
.I troffcvt
is concerned, they'll all be ignored.
This isn't really very useful.
.Li
Read the contents of the macro file first before reading the input file named
by the user.
This capability is built into
.I troffcvt :
if an argument
.B \-m \fIxx\fR
is specified,
.I troffcvt
looks for an appropriate macro package file (e.g.,
.I /usr/lib/tmac/tmac.xx )
and reads it.
The advantage of this approach is that you don't have to know anything
about how the macro package works.
There are two disadvantages:
.Ls B "" \*-
.Li
First, it's likely that
.I troffcvt
will botch some of the macros.
Macro packages tend to be very ugly, i.e., clever, complex, powerful,
forbidding, and obviously written by people who understand
.I troff
better than I.
Thus, you quickly discover just how unsophisticated
.I troffcvt
really is.
.Li
Second,
.I troffcvt
may write lots of output in response to a macro invocation in the input.
A paragraphing macro may set point size, vertical spacing, font, indent,
centering, underlining, etc.
What you'll see in the
.I troffcvt
output are the effects of all the requests to which the macro expands.
That's okay for preserving the appearance of a document.
It won't be helpful if instead you want your postprocessor to recognize
when the original macro was used in the
.I troff
file.
.Le 0
.Li
Redefine macros in an action file so that they
generate an output string that a postprocessor can notice as signifying
something special.
For instance, the
.Rq LP
and
.Rq IP
macros
from
.B \-ms
can be described like this:
.Ps
req LP eol output-control "\eother para"
req IP eol output-control "\eother indented-para"
.Pe
This will cause
.Cl other
.B para
or
.Cl other
.B indented-para
to be written to the output when instances of
.Rq LP
or
.Rq IP
occur in the input.
A postprocessor can recognize
.Cl other
.B para
and
.Cl other
.B indented-para
and do something sensible with them.
.Li
Redefine macros in an action file so that
they mimic the effect of the real macros.
This allows you to provide a definition of the macro that's simpler than
the one found in the macro package, and that will be easier for
.I troffcvt
to handle.
For example, the
.Rq AB
macro from
.B \-ms
can be defined, roughly, as:
.Ps
req AB parse-macro-args eol \e
	push-string ".br\en.ce\en\e\efIABSTRACT\e\efR\en.sp\en"
.Pe
You need to understand something about how the macros are supposed to work
for this approach to be fruitful.
.Le
You can also use a combination of the methods above.
Probably the easiest way to start is to run a few documents through
.I troffcvt
and supply only an
.B \-m \fIxx\fP
argument:
.Ps
% \f(CBtroffcvt -m\f(CIxx\fP myfile\f(CW
.Pe
This will tell you which macros
.I troffcvt
handles okay and which it botches.
With that information in hand, you can construct an action file
.I tc.mxx
containing redefinitions for those macros that
.I troffcvt
needs help with.
Try out the action file like this:
.Ps
% \f(CBtroffcvt -m\f(CIxx\fP -a tc.m\f(CIxx\fP myfile\f(CW
.Pe
By experimenting with
.I tc.mxx ,
you can improve
.I troffcvt 's
handling of any document that uses the
.B \-m \fIxx\fP
macro package.
.LP
Some of the examples shown above demonstrate how to redefine
macros, but do so by defining them using
.B req
lines.
Thus, these ``macros'' are actually treated by
.I troffcvt
as requests.
Before you redefine a macro as a request, be sure you understand the following
points:
.Ls B
.Li
.I troff
treats requests and macros similarly, but not exactly the same.
For example, you must specify the name of a macro and not a request as the
argument to the input-trap macro and end macro requests
.Rq it "" (
and
.Rq em ).
.I troffcvt
enforces this constraint, too.
.Li
If you define a name as a request, you can't apply
.Rq am
to it later.
(Actually, you can, but the effect is to delete the request definition
and begin a new macro definition \*- probably not what you want.)
.Le
If a name that you're defining in an action file
.I must
refer to a macro and not to a request (e.g., if you want to use it with
.Rq it
or
.Rq em ,
or if you want to be able to append to it later using
.Rq am ),
then don't define it using a
.B req
line.
If you do, it'll be considered a request by
.I troffcvt .
Instead, use an
.B imm
line containing a
.B push-string
action to execute a string that contains the contents of a
.Rq de
request.
For example:
.Ps
imm push-string ".de xx\en.tm this is macro xx\en..\en"
.Pe
If you provide redefinitions that might get used in concert with macro
packages written for
.I groff ,
here's something to watch out for:
before redefining a name for which a definition may have
already been read from the macro package file, it's prudent to remove
the name first, like this:
.Ps
imm remove-name XX
req XX \f(CIdefinition...\fP
.Pe
This is due to the way that
.I groff
implements macro packages.
Consider the
.B \-ms
macros.
These are supposed to be used such that
.Rq TL ,
.Rq AU ,
.Rq AI ,
and
.Rq AB
occur in order if they are used.
To make sure they aren't invoked out of order, the
.I groff
.B \-ms
definitions initially create
.Rq AU ,
.Rq AI ,
and
.Rq AB
as aliases to a macro that checks whether or not
.Rq TL
has been invoked.
When
.Rq TL
is invoked it redefines the other macros appropriately with their ``real''
definitions.
Now, suppose that you handle
.B \-ms
by reading the macro package file and then redefining in an action
file some of the macros such as
.Rq AI ,
and
.Rq AB .
If you simply provide a new definition of
.Rq AI ,
what happens is that you also redefine all other names that are aliased
along with
.Rq AI .
In other words, you also redefine
.Rq AU
and
.Rq AB !
If you then redefine
.Rq AB ,
you also redefine
.Rq AU
and
.Rq AI .
Removing a name before giving it a new definition avoids this problem.
.\"
.Bh "Conditions That Prevent Macro Redefinition"
.\"
.LP
Suppose you normally format a document
.I mydoc
using a command something like this:
.Ps
% \f(CBtroff -ms mydoc\fP
.Pe
If you use
.Rq so
.B mymacros
in
.I mydoc
to read a file of macro definitions, you may have a problem if you
want to process
.I mydoc
with
.I troffcvt .
In particular,
if you want to redefine any of the macros in
.I mymacros
for
.I troffcvt 's
benefit, you won't be able to use an action file to do so:
.Ls B
.Li
Any action file that affects processing of
.I mydoc
has to be read prior to
.I mydoc
(and thus prior to
.I mymacros ,
since that file is referenced from
.I mydoc ).
.Li
The later definition always takes precedence, so the definitions in
.I mymacros
override any attempted redefinition in the action file.
.Le
If you really need to redefine the macros in
.I mymacros
when you format
.I mydoc
with
.I troffcvt ,
you can use the following strategy:
.Ls B
.Li
Remove the
.Rq so
.B mymacros
request from
.I mydoc
and format the document this way:
.Ps
% \f(CBtroff -ms mymacros mydoc\fP
.Pe 0
.Li
Create an action file
.I tc.mymacros
and put into it any redefinitions for macros in
.I mymacros
that need to be redefined.
Then to generate
.I troffcvt
output, use this command:
.Ps
% \f(CBtroffcvt -ms mymacros -a tc.mymacros mydoc\fP
.Pe
This will cause
.I troffcvt
to read, in order,
the
.B \-ms
macros, the standard definitions of the macros in
.I mymacros ,
the redefinitions in the action file
.I tc.mymacros ,
and
.I mydoc .
.Le
If you use
.I groff ,
an alternative strategy can be used.
Leave the
.Rq so
.I mymacros
request in
.I mydoc ,
but surround each definition in
.I mymacros
with an
.Rq if
.B d
test:
.Ps
\&.if d xx .ig end_ignore
\fI\&...macro definition here...\fP
\&.end_ignore
.Pe
Then you can format the document with
.I troffcvt
like this:
.Ps
% \f(CBtroffcvt -ms -a tc.mymacros mydoc\fP
.Pe
When
.I tc.mymacros
is processed, it defines some or all the the macros used in
.I mymacros .
When
.I mydoc
is read and the
.Rq so
.I mymacros
request is processed, only those macros that were not already
defined in
.I tc.mymacros
will be defined.
.LP
Similar considerations apply if you define macros directly in your
.I troff
source file.
You won't be able to override them in an action file because the definition
in the
.I troff
source file occurs later and will take precedence.
To work around this, put the macro definitions in a separate file and
use the first strategy described above, or use
.Rq if
.B d
as in the second strategy.
.\"
.Ah "Input/Output Mechanisms"
.\"
.H*aname character-coding
.H*aend
.Bh "Character Coding"
.\"
.LP
.I ChIn()
returns values of type
.Cw XChar ,
which is
.Cw typedef 'ed
as an unsigned integer type.
The return
value falls into the following ranges:
.IP 0
.sp .5v
This value signifies end of file on the current input source.
.IP "1..127 (0x01..0x7f)"
.sp .5v
Plain ASCII character.
.IP "128..255 (0x80..0xff)"
.sp .5v
8-bit (non-ASCII) input character.
.IP "257..511 (0x101..0x1ff)"
.sp .5v
Escape code for ASCII or 8-bit character preceded by an escape
character (except
.Es ( ,
see below).
The code for \fB\e\fIX\fR is constructed as 0x100\||\|\fIX\fR.
Note that 0x100 is not a valid escape code because null bytes are
stripped from the input.
.IP ">=512 (>=0x200)"
.sp .5v
Special-character code.
Sequences of the form \fB\e(\fIxx\fR or \fB\e[\fIxxx\fP]\fR
are recognized and converted to special-character codes.
These codes start at 512 so that they are greater than all ASCII, 8-bit, or
escape codes.
If a special-character reference is encountered for a name that has no
definition (i.e., the character was not defined in any action file),
a new special character with an empty value is created on the fly.
This is done on the following grounds:
.RS
.Ls B
.Li
It's undesirable to simply panic.
.Li
It's undesirable to delete the sequence from the input, since that would
break ugly (but perfectly legal) constructions like this:
.Ps
\&.if \e(xx\f(CIstring\fP\e(xx\f(CIstring\fP\e(xx\ \f(CIstuff\fP ...
.Pe
A warning is written to
.I stderr
when characters are created this way so it can be
known there is a special-character definition missing from the action file
(or that the input file contains an erroneous special-character reference).
.Li
If the character is put back onto the input with
.I UnChIn() ,
the code can be converted back into the original name, even though the
character has no definition.
.sp .5v
A minor problem occurs when the
.Rq if
.B c
construction is used to test whether or not a character is defined, because
.B any
special character reference results in a valid
.Cw SpChar
structure being associated with it, even if the character didn't exist
before being referenced.
Therefore the test is made based on whether or not the character's value
(definition) is empty.
.Le
.RE
.LP
.I UnChIn()
takes an
.Cw XChar
argument, which is usually a value returned from
.I ChIn() .
.I UnChIn()
pushes the argument onto the input pushback stack, unpacking escape and
special-character codes into their original multiple-character input sequences.
Unpacking is done to prevent problems.
Suppose an escaped or special character is first seen in non-copy mode,
then pushed back and reread in copy mode.
If the escape code or special-character code itself were pushed back,
the character wouldn't be reread in copy mode properly.
.LP
Values for plain ASCII and 8-bit characters can be represented in a
single byte
(as an unsigned character), but escape codes and
special-character codes cannot, since they
begin at 512.
This is why the
.Cw XChar
type is wider than a single byte.
.LP
Special characters are disallowed in request arguments and escape sequences
that might be written back out directly.
For instance,
.Rq ft
.I F
is written out as
.Cl font
.I F ,
so
.I F
isn't allowed to contain special characters.
A similar restriction applies to diversion names.
.LP
Special-character names must consist entirely of printable ASCII characters.
They are not allowed to be composed of other special characters, e.g.,
.B \e(\e(ts\e(ts
is disallowed.
.\"
.Bh "Input Processing"
.\"
.LP
Input may come from a file, a macro, a named string (created with the
.B define-string
action, usually in response to a
.B \.ds
request), or an anonymous string (defined below under the description
of the
.I AChIn()
function).
The bulk of input usually comes from
input files named on the command line, which are processed in sequence.
Inputs sources may be nested (e.g., a macro or string
may be referenced while reading a file).
The current input is suspended when another input source is interpolated
into the input stream, and is resumed when the interpolated source is
exhausted.
.LP
.I ChIn()
returns the next input character from the input stream.
Embedded newlines (introduced with a backslash character
.B \e
at the end of a line) are deleted
so that the following input line appears contiguous with the current line
to any higher-level routines.
Comments (introduced with
.B \e )
are deleted up to (but not including) the end of line character.
For instance, this makes:
.Ps
text followed by comment\e" this is the comment
.Pe
appear to be:
.Ps
text followed by comment
.Pe
The handling of lines that begin with \fB\.\e"\fR happens properly; the
comment stripping makes the line look like a line beginning with a control
character but no request, so it is ignored.
.I ChIn()
also manages encoding
of escaped characters, and pushing to input sources for number register,
string or macro argument references.
Handling of escape sequences differs depending on whether
copy mode is in effect or not.
.LP
Input characters accepted by the file-input routine are non-null ASCII
values (null bytes and bytes with bit 8 on are discarded).
Escaped characters (\fB\e\fIx\fR)
and special-character references
\fB\e(\fIxx\fR or \fB\e[\fIxxx\fB]\fR)
are converted to escape codes and special-character codes as described
above under
.H*ahref #character-coding
``Character Coding.''
.H*aend
.LP
Input source pushing occurs automatically in
.I ChIn()
when
.Es n ,
.Es *
or
.Es $
occur
(and also
.Es w
if not in copy mode):
the input source switches to a string representing the value of the
number register or string, the macro argument, or the result of the width
calculation.
Higher level routines also can cause the current input source to be pushed
down, e.g., when a
.Rq so
request occurs.
.LP
.I ChIn()
is also used for the ugly task of processing multi-line conditional
input (bracketed with the
.Es {
and
.Es }
sequences).
The conditional request processor saves the current if-level
when it sees a
.Es { ,
bumps it up one,
then processes lines until the level drops back down to the saved
value.
.I ChIn()
notices
.Es } ,
silently deletes it and decrements the if-level, which is then noticed
by the conditional processor.
Pretty horrid.
.LP
.I UnChIn()
is used to push characters back onto the input stream.
It understands how to push back escape codes and special-character
codes properly.
It also understands how to push back multiple characters (characters must
be pushed in the reverse order from that in which they were read).
.LP
.I ChIn0()
returns the next raw (uninterpreted) character from the input stream.
If there are any pushed back characters waiting to be reread, it returns
the one most recently pushed.
Otherwise, it
returns the next input character from the current input source.
If that source is exhausted, it resumes reading from the previous source.
When there is no more input, it returns
.Cw endOfInput .
Input source unwinding is undetectable at any level above
.I ChIn0() ,
including
.I ChIn() .
.LP
.I FChIn() ,
.I MChIn() ,
and
.I AChIn()
are the lowest level input routines; they're called by
.I ChIn0() .
These return a single character from a file, macro or named string,
or ``anonymous''
string input source.
Each returns
.Cw endOfInput
when the source is exhausted (which only means the current
source is done, not necessarily that all sources are done).
.Cw EOF
is not returned because that is typically \-1 (negative), and the input
routines return a value of type
.Cw XChar ,
which is unsigned.
.LP
.I FChIn()
discards nulls.
(It also converts CR or CRLF to LF; this has nothing to do with
.I troff ,
but allows text files from MS-DOS or Macintosh
machines to be read without requiring you to convert line endings first.)
.LP
.I MChIn()
reads the next character from a macro or string definition.
(Strings are implemented internally as macros without arguments.)
.LP
.I AChIn()
reads the next character from an anonymous string,
which is just some arbitrary string that is to be used as an
input source.
For instance, when a number register reference
.Es n ) (
or width expression
.Es w ) (
occur, the resulting value is converted
to a character string, which becomes the current input until
the string is completely read.
References to macro arguments are treated similarly; the argument value
is retrieved and pushed on the input stack.
Another source of anonymous strings is the
.B push-string
action, which can be used in action files to push an arbitrary string onto
the input stream.
This is convenient for processing certain requests.
For instance, if you want to redefine a macro, you can define
the action for that macro to be one that pushes alternative input.
Here's an example that shows how the
.B \&.AB
macro from the
.B \-ms
macro package might be redefined:
.Ps
req AB parse-macro-args eol \e
	push-string ".br\en.ce\en\e\efIABSTRACT\e\efR\en.sp\en"
.Pe
One sticky problem occurs with the
.B \.nx
request, usually processed with the
.B switch-file
action.
When
.B \.nx
occurs, it might happen while other files or macros are active.
If the current input source is a file there is no problem since the file pointer
for that source is simply switched to the new file.
But if the request occurs in the middle of a macro, it's less clear what should
happen.
Should the macro continue to be processed?
I elect to terminate macro sources and unwind the source stack until a file is
found, then switch the file pointer of the file source.
Possibly this is wrong; the
.I troff
manual is ambiguous on this point.
(Which may be why different versions of
.I troff
behave differently in this situation.)
.LP
For handling the
.B \.ex
request, the
.B end-input
action it used; it sets a flag causing
.I ChIn ()
to return
.Cw endOfInput
forever after.
.\"
.Bh "Output Processing"
.\"
.LP
At the lowest output level, there are two calls.
One is for writing characters and it simply writes to the output
file and dies if there was an error.
The other is for writing strings; it calls the write-character routine
for each character in the string.
.LP
The next level up manages the mechanics of collecting plain text lines and
interspersing them with special text and control lines.
The basic issues are insertion of spaces between successive output text
lines and making sure that special text and control lines don't get
written into the middle of a plain text line.
.LP
Control lines begin with a backslash character
.Cw \e .
Any plain text output line being collected is flushed so the control
string doesn't appear on the same line.
.LP
There are two kinds of text output:
plain text lines, and special text lines
that indicate special characters (e.g.,
.St backslash
for the
.Cw \e
character.
Whenever text output (either kind) is written, a check is made
to see whether it's necessary to write a preceding space first.
A space is usually needed between consecutive input text lines (exceptions are
when centering or no-fill are in effect, or if an input line ends with a
.Es c ).
For special text,
any plain text output line being collected is flushed so the control
string doesn't appear on the same line.
.LP
The output character set for text is such that most printable ASCII
characters appear as themselves, and others are written out as special
text lines.
The characters tab, backspace, 
.B \e ,
and
.B @
are printable but written as specials
.St tab ,
.St backspace ,
.St backslash ,
and
.St at .
The leader character SOH is written as
.St leader .
.\"
.Ah "Input Levels"
.\"
.LP
.I troffcvt
maintains a notion of input level.
The level is incremented each time a new input source begins and decremented
when the current source ends.
A file interpolated with
.Rq so
is an input source, but so is a macro, a macro argument, a string, or a
number register.
This helps avoid the problem of interpreting something like this:
.Ps
\&.if '\e*[xx]'y' ...
.Pe
when the string
.B xx
contains an apostrophe.
.I troffcvt
uses the input level in such a way that
.I troff
constructs bounded by delimiters do not consider the closing delimiter
to be found unless it occurs at the same input level as the opening delimiter.
(If you simply look at characters as they occur, then the apostrophe
in the string prematurely terminates the scan for the first of the strings
to be compared, and throws off the comparison.)
The affected constructs include:
.Ps
\&.if '\f(CIx\fP'\f(CIy\fP'
\&.tl '\f(CIleft\fP'\f(CIcenter\fP'\f(CIright\fP'
\eb'\f(CIabc...\fP'
\eh'\f(CIN\fP'
\el'\f(CINc\fP'
\eL'\f(CINc\fP'
\eo'\f(CIabc...\fP'
\ev'\f(CIN\fP'
\ew'\f(CIstring\fP'
.Pe
The input level also affects parsing of macro arguments that begin with a
double quote.
Only a quote at the same input level as the opening
quote terminates the argument.
.LP
The behavior just described mimics how
.I groff
treats its input, not how standard
.I troff
treats its input.
However,
.I groff
ignores the input level (and thus acts like standard
.I troff ),
in compatibility mode.
.I troffcvt
does the same.
(Parsing routines that need to check the input level call the
.I ILevel()
function.
In compatibility mode this function always returns zero, making all
input appear to be at the same level.)
.LP
.I groff
produces a quoted argument list when
.Es $@
occurs in the input.
The
.I groff
documentation says that it
processes the list such that the quotes surrounding an argument appear
at the same input level, whereas the argument itself is processed
at a higher level.
(This prevents the problems that would occur if an argument contained
a quote.)
I take this to mean that the quotes surrounding the arguments are at a level
higher than the context in which the
.Es $@
occurs, and the arguments one level higher than that, in case something
like the following occurs in a macro:
.Ps
\&.xx "\e\e$@"
.Pe
If the quotes produced by
.Es $@
here were treated as being at the same level as quotes in the surrounding text,
the arguments to
.Rq xx
could be messed up.
.LP
.I troffcvt
handles
.Es $@
by constructing a string consisting of a list of argument references that
looks like this:
.Ps
"\e\e$1" "\e\e$2" ... "\e\e$\f(CIn\fP"
.Pe
Then the string is pushed on the input stack.
This causes the quotes to be processed a level higher than the surrounding
text.
When each argument reference in the string is encountered, the value of
the argument is pushed on the stack, causing the reference to be processed
another level higher.
.LP
.I troffcvt
handles
.Es $*
in a manner similar to
.Es $@
except that no quotes are added to the string containing the list
of argument references.
.\"
.Ah "Macro Argument Quoting"
.\"
.LP
Macro arguments consist of strings of non-white characters.
Arguments may be quoted to allow whitespace to be included.
An argument that begins
with a double quote is parsed in quote mode until a closing quote,
and the leading and terminating quotes are stripped off.
.LP
Double quotes in macro arguments are handled as follows:
.Ls B
.Li
If not in quote mode, just treat the quote as part of the argument.
.Li
If in quote mode and the quote is doubled, one quote character is stripped
off and the other becomes part of the argument.
If the next character is not a quote, the quote is the
closing quote of a quoted argument.
.Le
The quote stripping described above presents an interesting problem in standard
.I troff .
If an argument contains quotes, and then is used inside the macro by
being passed to another macro, quote stripping occurs again.
This is really ugly, because it means you must understand the implementation
of the macros you're using and know how many extra quotes to put in your
arguments so that they end up with the correct number when they finally
reach the bottom-level macro.
.LP
Neither
.I groff
nor
.I troffcvt
have this problem, since quotes in arguments occur at a higher level than
the surrounding text.
(In compatibility mode,
.I troffcvt
uses the quote-stripping behavior of standard
.I troff .)
