$Header: /CVSROOT/tinohtmlparse/README,v 1.7 2009-01-28 18:27:29 tino Exp $ This Works is placed under the terms of the Copyright Less License, see file COPYRIGHT.CLL. USE AT OWN RISK, ABSOLUTELY NO WARRANTY. However as it ekhtml is linked in statically, binaries must be compliant to the EKTHML license. Origin: http://www.scylla-charybdis.com/tool.php/tinohtmlparse Compile: ======== This is currently based on ekhtml, a deadly inperfect HTML parser (for example it does not parse Comments correctly, like in the following: ). Perhaps sometimes I come around and write a working version (which then shall be able to sanitize HTML as well), but for now, we keep it as it is. First fetch ekhtml from CVS at sourceforge.net: cvs -d :pserver:anonymous@ekthml.cvs.sourceforge.net:/cvsroot/ekhtml login (empty password) cvs -z3 -d :pserver:anonymous@ekhtml.cvs.sourceforge.net:/cvsroot/ekhtml co ekhtml Compile ekhtml: cd ekhtml; ./autogen.sh; make To compile tinohtmlparse: Create softlink to the source of ekhtml if you compiled it in another directory: ln -s ../somewhere/ekhtml ekhtml Then type make You can ignore the error that tinolib is missing as tinolib is not required for this. If you really want it, grab the distribution of tinolib and let the softlink point there like this: ln -s ../somewhere/tinolib-*/old tinolib Note that tinolib restricts distributions which bundle it to GPL! Usage: ====== Usage: tinohtmlparse [-r|--raw] [-l|--list] [-o|--old] --list shows a list of known entities --raw does not convert these entities to their % representation in attributes. --old uses the old output variant. The new variant separates the last argument from the previous one with a TAB instead of SPC. The HTML file is read from stdin and the output is written to stdout. The parsed lines all look like following template: TYPE TAG ATTR Q TEXT The words are separated by SPC or TAB. The first 4 words are guaranteed to not contain SPC or TAB ever. If they are empty it is guaranteed that no more words or text follows. - TYPE is a type string (see below) - TAG usually is the HTML TAG (ekhtml converts this to uppercase) - ATTR is the attribute name - Q is a Quote type of the text which follows. - TEXT is the text and is % escaped such that it is URL compatible When TYPE is "text" or "comment" then TAG is a number counting the lines starting with 0, ATTR is an LF flag and Q always is -. Q can be B for boolean attribs (those without =), N (was not quoted), ' or " (the quote which was used). There is a form where Quote is two HEX digits HH, but this never shall show up (it's in case ekhtml send some unusual quote character). So you can do ./tinohtmlparse < htmlfile | ./tinohtmlabsurl.sh "BASEURL" | while read -r type tag name q text do ... done Output documentation: ===================== open TAG close TAG Open and closing TAG tags encountered. TEXT is empty. attr TAG ATTR Q TEXT A named attribute, immediately follows "open". TAG is the TAG it belongs to, added for more easy parsing. ATTR is the attribute's name. The text is URL-escaped with %, that is %xx is the hex representation of any unusual character (including SPC). For unicode characters there is the representation %uXXXX. If HTMLentities (like &) are encountered, they are automatically changed into the character representation. (if not --raw given.) text COUNT LF - TEXT comment COUNT LF - TEXT COUNT is the line count. LF is either 0 (TEXT does not contain a LF) or 1 (TEXT does contain an LF). Multiple lines are repeated with the line count counted up, so there are no complex to parse continuation lines. In case of the comment form, this is the commented out text. Notes: ====== People out there partly write deadly HTML code. tinohtmlparse is too perfect to handle this - all heuristics are missing. For example what I already have seen (yes, this are TWO lines): Firefox handles this correctly and loads pic.jpg! So it silently trims SPCs from URLs and removes CRs and LFs from within, too. tinohtmlparse does not do this. It transforms the URL into %20pic%0a.jpg, which then probably is not found on the server. If you need to parse such crap, apply your own heuristics. tinohtmlparse will never do this for you. Please also note that before 0.1.4 the SPC was not escaped to %20, it was output unchanged. This is fixed now. -Tino webmaster@scylla-charybdis.com $Log: README,v $ Revision 1.7 2009-01-28 18:27:29 tino TAB for separator for last argument changed Revision 1.6 2007-12-30 17:57:03 tino Placed under the CLL, also one entity code was fixed (∧) Revision 1.5 2007-12-30 17:15:45 tino SF URL Revision 1.4 2007-09-16 06:19:05 tino Output documentation fixed and notes added Revision 1.3 2007/09/16 06:10:17 tino corrected Revision 1.2 2006/06/11 06:57:30 tino Mainly only documentation corrected Revision 1.1 2005/02/05 23:07:28 tino first commit, tinohtmlparse.c is missing "text" aggregation