HTML2TEXT v1.20b

1997 © Gavin Spearhead


  1. What is it?

    HTML2TEXT is a utility that converts HTML files to plain text. Optionally it also tries to figure out if the HTML file is well-constructed.

    All Rights Reserved

    Permission to use, copy, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation, and that the name Gavin Spearhead not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission.

    *** DISCLAIMER ***

    GAVIN SPEARHEAD DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL GAVIN SPEARHEAD BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

    ***

    Any bugs, errors, or suggestions should be sent to the author. Also the existence of not supported HTML-tags or amp-codes can be sent to the author, along with a description, restrictions and options.

    You are encouraged to register this software. This means that you will either receive latest versions when they are released or a note that a new version is released. It also gives me an idea about how many people use this program and how it's spread. There are three ways to register:

    1. Start your web-browser and fill in the form
    2. Convert register.htm to a text file, edit it to fill in the entries and email it to me
    3. Same as above but send it to my postal address
    Note that registration is Free of charge!

    When you're registered you will become a registration key, so that your name is written when you execute the program. This file will be sent via email if possible. This is currently the only way to receive the registration key.

  2. Which files are contained in the package?

    HTML2TXT.EXEThe executable
    HTML2TXT.CFGConfiguration file with option
    HTML2TXT.INIIni-file with amp strings
    HTML2TXT.HTMDocumentation for HTML2TEXT in HTML format
    HTML2TXT.TXTDocumentation for HTML2TEXT in text format
    REGISTER.HTMRegistration form in HTML format

    If one of the files is missing, throw the package away and ask the author for a new complete copy. The address is at the end of the file.

  3. How to start it?

    Type at the commandline:

      HTML2TXT <filespecification> <options>

    <Filespecification> is the name of the files to convert, it may include wildcards. It may appear more than once on the command line. Note that long filenames (Windows 95) are not supported. This means that input filenames have to be of the 8.3 format (Every W95 file has a 8.3 filename and optionally a long filename). The output will be a 8.3 filename.

    <options> Can be the following

    -x Errors are marked in the outputfile by [XXX <error> ]
    -w Warn for HTML-errors in sourcefile
    -s Write output to standard output
    -e Display the title in the output file
    -a Display the alternative text for an image
    -l Write the link to output file
    -i Write user input fields and buttons to output file
    -t Reformat tables
    -q Treat balancing of quotes strictly
    -h Both Display a help screen
    -?
    -o Controls the overwriting policy of existing files The suffixed characters have the following meaning:
    • A: Always overwrite
    • V: Never overwrite
    • D: Always append
    -r controls the wrapping of lines policy:
    If suffixed by a number the length of a line will be maximally that number, if suffixed by a '-' no line wrapping will take place. If no suffix is given, the lines will be wrapped according to the screen width (usually 80 characters)

    The result file will have the same name as the original file, but with the extension specified in the config-file (default is .txt), unless the original extension was the same as the extension of the output file, then the extension will be '.tx1' (always).

    All messages are written to stderr.

  4. What do the files HTML2TXT.INI & HTML2TXT.CFG do?

  5. What does it do?

    HTML2TEXT converts HyperText Mark-up Language (HTML) files to plain-text (ASCII) files. The following rules are applies for this:

  6. Which tags does it recognise?

    Tag What it does in HTML2TEXT
    A Checks unless <a name=...>, optionally a [ Name ] or [ Link ] is written
    ADDRESS See I
    APPLET Checks, ignores text between <APPLET></APPLET>
    AREA Ignore
    B Checks, Optionally writes BOLD-token
    BASE Ignores
    BASEFONTIgnores
    BGSOUND Ignores
    BIG Checks
    BLINK Checks (Does this really make the text blink???)
    BLOCKQUOTEChecks, indents
    BODY Checks
    BR Writes a newline
    CAPTION Ignore
    CENTER Checks, centers when linewrap is on
    CITE see I
    CODE See Pre
    COMMENT Ignores anything between <COMMENT></COMMENT>, Checks
    DD Inserts newline and indents
    DFN See I
    DIR See OL
    DIV Checks, writes a newline at both open and close tag (Is this correct??)
    DL Starts a definition list, Checks
    DT Inserts a newline
    EM see I
    EMBED Ignore
    FRAME Ignore
    FRAMESETChecks
    FONT Checks
    FORM Checks
    H Checks
    HD1 Writes the text to screen with embracing newlines
    HD2
    HD3
    HD4
    HD5
    HD6
    HEAD Checks
    HR Writes a line of '='s incase size >3 or else a line of '-'s. The length is absolute or relative set according to the width value.
    HTML Everything after </HTML> is ignored, Checks
    I Checks, Optionally writes ITALIC-token
    IMG Ignored, Optionally writes 'alt' text
    INPUT Ignored
    ISINDEX Write a prompt plus optionally [ Input ]
    KBD See B
    LI Writes a listelement identifier, for ULs * or specified in config-file, for OL a number, parameter type and value used
    LINK Ignored
    LISTING See Pre
    MAP Checks
    MARQUEE Checks
    MENU See OL
    META Ignored
    NEXTID Ignores
    NOBR Checks
    NOFRAMESChecks
    OL An ordered list, Checks, type parameter used
    OPTION Ignore
    P Starts a new paragraph
    PRE Outputted as is, Checks (line wrap is not ignored, if on)
    S see strike
    SAMP Checks
    SCRIPT Ignores anything between <SCRIPT></SCRIPT>, Checks
    SELECT Checks
    SMALL Checks
    SOUND Ignores
    STRIKE Checks
    STRONG See B
    SUB Checks
    SUP Checks
    TABLE Checks, starts/finishes a table
    TD Defines a table cell
    TEXTAREAIgnored
    TITLE Writes the title, if within <HEAD></HEAD>, Checks
    TH Defines a table header cell
    TR Defines a table row
    TT Checks
    U Checks
    UL An unordered list, Checks, type parameter used
    VAR See Pre
    WBR Ignores
    !DOCTYPEIgnores

    Here Checks means that for every open tag a matching closing tag is sought. In most cases the order of the closing tag is not relevant, but rarely the output will be unexpected.

    Here ignores means that the tag is just ignored, no output is generated.

    Some tags may have optional closing tags, these are ignored and not checked. Eg. <tr>,<td>,<th>,<p>. Some tags need a closing tag but not always (eg <a name=...>) then only those who do need one will checked. Note that this just specifies the actions taken by HTML2TEXT and not what the HTML specification says

    Some of those need a closing tag (preceded by a slash), these will be checked, if the tag was opened before. It will also be checked if those tags are closed in the right order. Furthermore is checked that tags are not nested if not necessary (eg. bold), usually this indicates a missing slash in the tag in the second tag. Lots of tags are simply ignored and thus generate no output. Some tags optionally generate output. Any text after </html> is ignored. Some tags cause the following text to be ignored.

    Unknown tags are ignored and optionally a message is generated.

    Tables generate the following output. Every table row is written on a at least one line, and every row yields a linefeed. Table columns are separated by at least one space (no boxes or anything). Options are implemented for tables, but currently do not work very well, a row can only be affected by at most one rowspan and one colspan. Also text won't be strechted to the full length of cells with rowspans, the surrounding cells will be empty instead. Tables are squeezed to a minimum size, if linewrap is chosen. Otherwise a cell will be of the length of the longest cell in the column. For long tables check out the config-file to set some parameters so that those are handled well too (who uses tables larger than 256 × 10 with cells of more than 64 KB, however some people build their whole pages in a table...). if you do you will have to increase the max_rows and the max_cols in the config-file. If this is necessary the most likely errormessages are error 13 and error 14. Also possible are error 7 and error 12. Except for error 12 these errors are fatal errors. Occasionally this may also lead a situation in which your machine seems not to respond. Nested tables aren't supported either, those will be treated as if the notables option is set to on. Only the outer-table will be formatted.

  7. What does it output?

    HTML2TEXT can have two kinds of output:

    1. It can just throw all output to standard output. This means that all files specified are concatenated to stdout. This also means that one can pipe or redirect the output directly.

    2. It can create files with extension specified in the config-file (or 'tx1' in case the input has that extension). If the output already exists, the user is asked to confirm overwriting of the file.

    Note that all messages are written to standard error. This is because one needs to make a distinction between the converted text and the additional info outputted by HTML2TEXT. Thus any messages are written to the screen even if stdout is redirected. Standard error can be redirected as well btw (however command.com does not support it). Also the hush option will prevent output to stderr.

  8. Errors and Warnings

    Errors

    ErrorError stringDescription
    1Illegal parameterA command line parameter was not recognised
    2No such fileNo file was found matching the file name specification
    3No filename specifiedNo file specification was found on the command line
    4Config-file not foundThe program could not locate the config-file, which is usually found in the current directory or in the directory containing html2txt.exe
    5Ini-file not foundThe program could not locate the ini-file (see error 4)
    6Error in ini-fileOne entry in the ini-file contains an illegal value
    7Not enough memoryThere wasn't enough memory to execute the program
    8File couldn't be openedOne file could not be found or opened
    9Error in config-fileOne entry in the config-file contains an illegal value
    10To many amp-codes in ini-fileThe ini-file contains too many codes to hold in memory
    11File skippedThe file couldn't be converted
    12Heap corruptedMemory is being corrupted during the conversion
    13Too many rows in tableThe table contains more rows than the program can keep in memory
    14Too many columns in tableThe table contains more columns than the program can keep in memory
    15Specified path is illegalThe ini-file couldn't be found in the specified path
    16Could not create temporary fileThere isn't enough space on the disk or there arn'te enough handles to open a temporary file
    17File writing error: Disk fullAn error occured while writing a file, most like is that the disk is full
    Warnings

    WarningWarning StringDescription
    256Unrecognised HTML-codeThe HTML-code was not recognised, probably not defined
    257Ill-contructed HMTL-codeThe HTML-code was different from the one expected, probable the order of codes is swapped
    258Illegal list itemThe list item or list was of an illegal type
    259Semicolon expectedThe semicolon after a &... sequence is missing
    260Illegal tokenA token was encountered which wasn't legal in the context
    261Ill-constructed amp codeThe amp code is not defined in the ini-file
    262Misplaced <title>The title appeared outside of the head section
    263HTML-tag starts with spaceThe HTML tage starts with a space character
    264Invalid list typeThe type specified for a list item or a list was illegal
    265Unexpected '>' encounteredA greater-than token was encounted without a matching less than token
    266LI without listAn LI tag was encountered outside a list section
    267DD without definition listAn DD tag was encountered outside a definition list
    268DT without definition listAn DT tag was encountered outside a definition list
    269Tables within tables not supportedA table section within a table was encountered
    270Table cell truncatedA table cell contained more than 65K data

  9. Problems & Open Issues

  10. How to obtain a new copy of HTML2TEXT?

    There are several ways to obtain a copy of html2text:

    1. Download it here: www.noord.bart.nl/~wieger1/h2t120b.zip

    2. Write an email to me and a ask. I will return the latest copy attached to the reply as a self-extracting archive.

    3. Write to my postal address and ask. Be sure to enough (Dutch) money to cover the postage. Also include the return address. For Dutch return addresses include a SASE (with enough postage).

    4. Look at the nearest BBS or internet site, for a copy.

  11. What's left do work out?

  12. What changes were made?

    1.20a & 1.20b
    • Bugs fixed and internal revisions
    • Table reformatting improved (no more garbage output)
    • Multiple file specifications on commandline
    • Added an option so that no output is generated (-h)
    1.10
    • Fixed some bugs
    • Improved commandline options parsing
    1.02
    • Added formatting of tables.
    • Better processing of user input (forms)
    • Line wrapping added internally (WORDWRAP.EXE is not necessary anymore)
    • Option added to set line wrap
    • WORDWRAP package not included anymore
    • Amp sequences of the format &#nnn; can now be defined, but can be ignored.
    • A very nasty bug fixed which corrupted the heap
    • HTML warnings optional
    • Added and removed commandline options
    1.01
    • Fixed a nasty bug, when outputing to STDOUT, the # of errors was displayed before the last text.
    • More intellegent algorithm for performing simple text
    • Added value and type options to <LI> tag
    • Added option for title
    • Decreased default value for stack and tag size
    • Documentation converted to HTML format
    1.00
    • First version, not released

  13. How to reach the author?

    Write email to:

    wieger1@noord.bart.nl

    or
    schotanu@cs.utwente.nl

    Write to:
    Gavin Spearhead
    Witbreuksweg 387-302
    7522 ZA Enschede
    The Netherlands

    This the latest version of this file can be found at www.noord.bart.nl/~wieger1/html2txt.htm