$Header: /CVSROOT/tinohtmlparse/README,v 1.7 2009-01-28 18:27:29 tino Exp $
This Works is placed under the terms of the Copyright Less License,
see file COPYRIGHT.CLL. USE AT OWN RISK, ABSOLUTELY NO WARRANTY.
However as it ekhtml is linked in statically, binaries must
be compliant to the EKTHML license.
Origin: http://www.scylla-charybdis.com/tool.php/tinohtmlparse
Compile:
========
This is currently based on ekhtml, a deadly inperfect HTML parser (for
example it does not parse Comments correctly, like in the following:
). Perhaps
sometimes I come around and write a working version (which then shall
be able to sanitize HTML as well), but for now, we keep it as it is.
First fetch ekhtml from CVS at sourceforge.net:
cvs -d :pserver:anonymous@ekthml.cvs.sourceforge.net:/cvsroot/ekhtml login
(empty password)
cvs -z3 -d :pserver:anonymous@ekhtml.cvs.sourceforge.net:/cvsroot/ekhtml co ekhtml
Compile ekhtml:
cd ekhtml; ./autogen.sh; make
To compile tinohtmlparse:
Create softlink to the source of ekhtml if you compiled it in another directory:
ln -s ../somewhere/ekhtml ekhtml
Then type
make
You can ignore the error that tinolib is missing as tinolib is not
required for this. If you really want it, grab the distribution of
tinolib and let the softlink point there like this:
ln -s ../somewhere/tinolib-*/old tinolib
Note that tinolib restricts distributions which bundle it to GPL!
Usage:
======
Usage:
tinohtmlparse [-r|--raw] [-l|--list] [-o|--old]
--list shows a list of known entities
--raw does not convert these entities to their % representation in
attributes.
--old uses the old output variant. The new variant separates the
last argument from the previous one with a TAB instead of SPC.
The HTML file is read from stdin and the output is written to stdout.
The parsed lines all look like following template:
TYPE TAG ATTR Q TEXT
The words are separated by SPC or TAB. The first 4 words are
guaranteed to not contain SPC or TAB ever. If they are empty it is
guaranteed that no more words or text follows.
- TYPE is a type string (see below)
- TAG usually is the HTML TAG (ekhtml converts this to uppercase)
- ATTR is the attribute name
- Q is a Quote type of the text which follows.
- TEXT is the text and is % escaped such that it is URL compatible
When TYPE is "text" or "comment" then TAG is a number counting the
lines starting with 0, ATTR is an LF flag and Q always is -.
Q can be B for boolean attribs (those without =), N (was not quoted),
' or " (the quote which was used). There is a form where Quote is two
HEX digits HH, but this never shall show up (it's in case ekhtml send
some unusual quote character).
So you can do
./tinohtmlparse < htmlfile |
./tinohtmlabsurl.sh "BASEURL" |
while read -r type tag name q text
do
...
done
Output documentation:
=====================
open TAG
close TAG
Open and closing TAG tags encountered. TEXT is empty.
attr TAG ATTR Q TEXT
A named attribute, immediately follows "open".
TAG is the TAG it belongs to, added for more easy parsing.
ATTR is the attribute's name.
The text is URL-escaped with %, that is %xx is the hex
representation of any unusual character (including SPC).
For unicode characters there is the representation %uXXXX.
If HTMLentities (like &) are encountered, they are
automatically changed into the character representation.
(if not --raw given.)
text COUNT LF - TEXT
comment COUNT LF - TEXT
COUNT is the line count.
LF is either 0 (TEXT does not contain a LF) or 1 (TEXT does
contain an LF). Multiple lines are repeated with the line
count counted up, so there are no complex to parse
continuation lines.
In case of the comment form, this is the commented out text.
Notes:
======
People out there partly write deadly HTML code. tinohtmlparse is
too perfect to handle this - all heuristics are missing.
For example what I already have seen (yes, this are TWO lines):
Firefox handles this correctly and loads pic.jpg! So it silently
trims SPCs from URLs and removes CRs and LFs from within, too.
tinohtmlparse does not do this. It transforms the URL into
%20pic%0a.jpg, which then probably is not found on the server.
If you need to parse such crap, apply your own heuristics.
tinohtmlparse will never do this for you.
Please also note that before 0.1.4 the SPC was not escaped to
%20, it was output unchanged. This is fixed now.
-Tino
webmaster@scylla-charybdis.com
$Log: README,v $
Revision 1.7 2009-01-28 18:27:29 tino
TAB for separator for last argument changed
Revision 1.6 2007-12-30 17:57:03 tino
Placed under the CLL, also one entity code was fixed (∧)
Revision 1.5 2007-12-30 17:15:45 tino
SF URL
Revision 1.4 2007-09-16 06:19:05 tino
Output documentation fixed and notes added
Revision 1.3 2007/09/16 06:10:17 tino
corrected
Revision 1.2 2006/06/11 06:57:30 tino
Mainly only documentation corrected
Revision 1.1 2005/02/05 23:07:28 tino
first commit, tinohtmlparse.c is missing "text" aggregation