HTML2TEXT is a utility that converts HTML files to plain text. Optionally it also tries to figure out if the HTML file is well-constructed.
All Rights Reserved
Permission to use, copy, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation, and that the name Gavin Spearhead not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission.
GAVIN SPEARHEAD DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL GAVIN SPEARHEAD BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
Any bugs, errors, or suggestions should be sent to the author. Also the existence of not supported HTML-tags or amp-codes can be sent to the author, along with a description, restrictions and options.
You are encouraged to register this software. This means that you will either receive latest versions when they are released or a note that a new version is released. It also gives me an idea about how many people use this program and how it's spread. There are three ways to register:
When you're registered you will become a registration key, so that your name is written when you execute the program. This file will be sent via email if possible. This is currently the only way to receive the registration key.
HTML2TXT.EXE | The executable |
HTML2TXT.CFG | Configuration file with option |
HTML2TXT.INI | Ini-file with amp strings |
HTML2TXT.HTM | Documentation for HTML2TEXT in HTML format |
HTML2TXT.TXT | Documentation for HTML2TEXT in text format |
REGISTER.HTM | Registration form in HTML format |
If one of the files is missing, throw the package away and ask the author for a new complete copy. The address is at the end of the file.
Type at the commandline:
HTML2TXT <filespecification> <options>
<options> Can be the following
-x | Errors are marked in the outputfile by [XXX <error> ] |
-w | Warn for HTML-errors in sourcefile |
-s | Write output to standard output |
-e | Display the title in the output file |
-a | Display the alternative text for an image |
-l | Write the link to output file |
-i | Write user input fields and buttons to output file |
-t | Reformat tables |
-q | Treat balancing of quotes strictly |
-h | Both Display a help screen |
-? | |
-o | Controls the overwriting policy of existing files
The suffixed characters have the following meaning:
|
-r | controls the wrapping of lines policy: If suffixed by a number the length of a line will be maximally that number, if suffixed by a '-' no line wrapping will take place. If no suffix is given, the lines will be wrapped according to the screen width (usually 80 characters) |
The result file will have the same name as the original file, but with the extension specified in the config-file (default is .txt), unless the original extension was the same as the extension of the output file, then the extension will be '.tx1' (always).
All messages are written to stderr.
This file contains the translation table for ampersand sequences, ie. a sequence of characters of the form: <some_text>. The lines are of following format:
<identifier>="<result>"
Or of the format \<number> then the character which ASCII value equals <number> is inserted. Every other character is literally insert, including quotes and bashes.
Amp-codes of the format &#nnn; not specified in the config-file will be converted to the ASCII value nnn.
Beware that some options have side effects, eg. turning off line wrapping means also that text will not be centered.
In both the files any line starting with a semicolon is treated as comments and thus ignored.
Both files will be sought for in the current directory first and then the directory from where HTML2TEXT was started. Usually these files will be placed in the same directory as HTML2TXT.EXE, a directory in your path.
HTML2TEXT converts HyperText Mark-up Language (HTML) files to plain-text (ASCII) files. The following rules are applies for this:
Tag | What it does in HTML2TEXT |
---|---|
A | Checks unless <a name=...>, optionally a [ Name ] or [ Link ] is written |
ADDRESS | See I |
APPLET | Checks, ignores text between <APPLET></APPLET> |
AREA | Ignore |
B | Checks, Optionally writes BOLD-token |
BASE | Ignores |
BASEFONT | Ignores |
BGSOUND | Ignores |
BIG | Checks |
BLINK | Checks (Does this really make the text blink???) |
BLOCKQUOTE | Checks, indents |
BODY | Checks |
BR | Writes a newline |
CAPTION | Ignore |
CENTER | Checks, centers when linewrap is on |
CITE | see I |
CODE | See Pre |
COMMENT | Ignores anything between <COMMENT></COMMENT>, Checks |
DD | Inserts newline and indents |
DFN | See I |
DIR | See OL |
DIV | Checks, writes a newline at both open and close tag (Is this correct??) |
DL | Starts a definition list, Checks |
DT | Inserts a newline |
EM | see I |
EMBED | Ignore |
FRAME | Ignore |
FRAMESET | Checks |
FONT | Checks |
FORM | Checks |
H | Checks |
HD1 | Writes the text to screen with embracing newlines |
HD2 | |
HD3 | |
HD4 | |
HD5 | |
HD6 | |
HEAD | Checks |
HR | Writes a line of '='s incase size >3 or else a line of '-'s. The length is absolute or relative set according to the width value. |
HTML | Everything after </HTML> is ignored, Checks |
I | Checks, Optionally writes ITALIC-token |
IMG | Ignored, Optionally writes 'alt' text |
INPUT | Ignored |
ISINDEX | Write a prompt plus optionally [ Input ] |
KBD | See B |
LI | Writes a listelement identifier, for ULs * or specified in config-file, for OL a number, parameter type and value used |
LINK | Ignored |
LISTING | See Pre |
MAP | Checks |
MARQUEE | Checks |
MENU | See OL |
META | Ignored |
NEXTID | Ignores |
NOBR | Checks |
NOFRAMES | Checks |
OL | An ordered list, Checks, type parameter used |
OPTION | Ignore |
P | Starts a new paragraph |
PRE | Outputted as is, Checks (line wrap is not ignored, if on) |
S | see strike |
SAMP | Checks |
SCRIPT | Ignores anything between <SCRIPT></SCRIPT>, Checks |
SELECT | Checks |
SMALL | Checks |
SOUND | Ignores |
STRIKE | Checks |
STRONG | See B |
SUB | Checks |
SUP | Checks |
TABLE | Checks, starts/finishes a table |
TD | Defines a table cell |
TEXTAREA | Ignored |
TITLE | Writes the title, if within <HEAD></HEAD>, Checks |
TH | Defines a table header cell |
TR | Defines a table row |
TT | Checks |
U | Checks |
UL | An unordered list, Checks, type parameter used |
VAR | See Pre |
WBR | Ignores |
!DOCTYPE | Ignores |
Here Checks means that for every open tag a matching closing tag is sought. In most cases the order of the closing tag is not relevant, but rarely the output will be unexpected.
Here ignores means that the tag is just ignored, no output is generated.
Some tags may have optional closing tags, these are ignored and not checked. Eg. <tr>,<td>,<th>,<p>. Some tags need a closing tag but not always (eg <a name=...>) then only those who do need one will checked. Note that this just specifies the actions taken by HTML2TEXT and not what the HTML specification says
Some of those need a closing tag (preceded by a slash), these will be checked, if the tag was opened before. It will also be checked if those tags are closed in the right order. Furthermore is checked that tags are not nested if not necessary (eg. bold), usually this indicates a missing slash in the tag in the second tag. Lots of tags are simply ignored and thus generate no output. Some tags optionally generate output. Any text after </html> is ignored. Some tags cause the following text to be ignored.
Unknown tags are ignored and optionally a message is generated.
Tables generate the following output. Every table row is written on a at least one line, and every row yields a linefeed. Table columns are separated by at least one space (no boxes or anything). Options are implemented for tables, but currently do not work very well, a row can only be affected by at most one rowspan and one colspan. Also text won't be strechted to the full length of cells with rowspans, the surrounding cells will be empty instead. Tables are squeezed to a minimum size, if linewrap is chosen. Otherwise a cell will be of the length of the longest cell in the column. For long tables check out the config-file to set some parameters so that those are handled well too (who uses tables larger than 256 × 10 with cells of more than 64 KB, however some people build their whole pages in a table...). if you do you will have to increase the max_rows and the max_cols in the config-file. If this is necessary the most likely errormessages are error 13 and error 14. Also possible are error 7 and error 12. Except for error 12 these errors are fatal errors. Occasionally this may also lead a situation in which your machine seems not to respond. Nested tables aren't supported either, those will be treated as if the notables option is set to on. Only the outer-table will be formatted.
HTML2TEXT can have two kinds of output:
Note that all messages are written to standard error. This is because one needs to make a distinction between the converted text and the additional info outputted by HTML2TEXT. Thus any messages are written to the screen even if stdout is redirected. Standard error can be redirected as well btw (however command.com does not support it). Also the hush option will prevent output to stderr.
Errors
Error | Error string | Description |
---|---|---|
1 | Illegal parameter | A command line parameter was not recognised |
2 | No such file | No file was found matching the file name specification |
3 | No filename specified | No file specification was found on the command line |
4 | Config-file not found | The program could not locate the config-file, which is usually found in the current directory or in the directory containing html2txt.exe |
5 | Ini-file not found | The program could not locate the ini-file (see error 4) |
6 | Error in ini-file | One entry in the ini-file contains an illegal value |
7 | Not enough memory | There wasn't enough memory to execute the program |
8 | File couldn't be opened | One file could not be found or opened |
9 | Error in config-file | One entry in the config-file contains an illegal value |
10 | To many amp-codes in ini-file | The ini-file contains too many codes to hold in memory |
11 | File skipped | The file couldn't be converted |
12 | Heap corrupted | Memory is being corrupted during the conversion |
13 | Too many rows in table | The table contains more rows than the program can keep in memory |
14 | Too many columns in table | The table contains more columns than the program can keep in memory |
15 | Specified path is illegal | The ini-file couldn't be found in the specified path |
16 | Could not create temporary file | There isn't enough space on the disk or there arn'te enough handles to open a temporary file |
17 | File writing error: Disk full | An error occured while writing a file, most like is that the disk is full |
Warning | Warning String | Description |
---|---|---|
256 | Unrecognised HTML-code | The HTML-code was not recognised, probably not defined |
257 | Ill-contructed HMTL-code | The HTML-code was different from the one expected, probable the order of codes is swapped |
258 | Illegal list item | The list item or list was of an illegal type |
259 | Semicolon expected | The semicolon after a &... sequence is missing |
260 | Illegal token | A token was encountered which wasn't legal in the context |
261 | Ill-constructed amp code | The amp code is not defined in the ini-file |
262 | Misplaced <title> | The title appeared outside of the head section |
263 | HTML-tag starts with space | The HTML tage starts with a space character |
264 | Invalid list type | The type specified for a list item or a list was illegal |
265 | Unexpected '>' encountered | A greater-than token was encounted without a matching less than token |
266 | LI without list | An LI tag was encountered outside a list section |
267 | DD without definition list | An DD tag was encountered outside a definition list |
268 | DT without definition list | An DT tag was encountered outside a definition list |
269 | Tables within tables not supported | A table section within a table was encountered |
270 | Table cell truncated | A table cell contained more than 65K data |
There are several ways to obtain a copy of html2text:
1.20a & 1.20b |
|
1.10 |
|
1.02 |
|
1.01 |
|
1.00 |
|
Write email to:
wieger1@noord.bart.nl
or
schotanu@cs.utwente.nl
Write to:
Gavin Spearhead
Witbreuksweg 387-302
7522 ZA Enschede
The Netherlands
This the latest version of this file can be found at www.noord.bart.nl/~wieger1/html2txt.htm