| 
.TH DOC2TXT 1
.SH NAME
doc2txt, xls2txt olefs, mswordstrings msexceltable \- extract printable strings from Microsoft Office documents
.SH SYNOPSIS
.B doc2txt
[
.I file.doc
]
.br
.B xls2txt
[
.I file.xls
]
.br
.B aux/olefs
[
.B -m
.I mtpt
]
.I file.doc
.br
.B aux/mswordstrings 
.I /mnt/doc/WordDocument
.br
.B aux/msexceltable
[
.B -aDnt
] [
.B -d
.I delim
]
.I /mnt/doc/Workbook
.SH DESCRIPTION
.I Doc2txt
is a shell script that uses 
.I olefs
and
.I mswordstrings
to extract the printable text from the body of a Microsoft Word document.
.I Xls2txt
performs a similar function for Microsoft Excel documents.
.PP
Microsoft Office documents are stored in OLE (Object Linking and Embedding)
format, which is a scaled down version of Microsoft's FAT file system.
.I Olefs
presents the contents of an Office document as a file system
on
.IR mtpt ,
which defaults to
.BR /mnt/doc .
.I Mswordstrings
or
.I msexceltables
may then be used to parse the files inside, extracting
a text stream.
.I Msexceltables
may be given options to control the formatting of its output.
.TP
.B -n
Disables field padding to colum width.
.TP
.B -t
Truncate fields to the colum width.
.TP
.B -a
Attempt conversion of non-tabular sheets in the workbook. (charts).
.TP
.BI -d " delim
Sets the interfield delimiter to the string
.IR delim ,
by default a single space.
.TP
.B -D
Enables debugging output.
.SH SOURCE
.B /sys/src/cmd/aux/mswordstrings.c
.br
.B /sys/src/cmd/aux/msexceltables.c
.br
.B /sys/src/cmd/aux/olefs.c
.br
.B /rc/bin/xls2txt
.br
.B /rc/bin/doc2txt
.SH SEE ALSO
.IR strings (1)
.br
``Microsoft Word 97 Binary File Format'',
available on line at Microsoft's developer home page.
.br
``LAOLA Binary Structures'', 
.I http://snake.cs.tu-berlin.de:8081/~schwartz/pmh 
.br
``OpenOffice.Org's Excel Documentation'',
.I http://sc.openoffice.org/excelfileformat.pdf
 |