html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).
Usage: html2text [(filename|url) [encoding]]
Option | Description |
---|---|
--version |
Show program's version number and exit |
-h , --help |
Show this help message and exit |
--ignore-links |
Don't include any formatting for links |
--protect-links |
Protect links from line breaks surrounding them "+" with angle brackets |
--ignore-images |
Don't include any formatting for images |
--images-to-alt |
Discard image data, only keep alt text |
-g , --google-doc |
Convert an html-exported Google Document |
-d , --dash-unordered-list |
Use a dash rather than a star for unordered list items |
-b BODY_WIDTH , --body-width =BODY_WIDTH |
Number of characters per output line, 0 for no wrap |
-i LIST_INDENT , --google-list-indent =LIST_INDENT |
Number of pixels Google indents nested lists |
-s , --hide-strikethrough |
Hide strike-through text. only relevent when -g is specified as well |
--escape-all |
Escape all special characters. Output is less readable, but avoids corner case formatting issues. |
--bypass-tables |
Format tables in HTML rather than Markdown syntax. |
--single-line-break |
Use a single line break after a block element rather than two. |
Or you can use it from within Python
:
>>> import html2text
>>>
>>> print(html2text.html2text("<p><strong>Zed's</strong> dead baby, <em>Zed's</em> dead.</p>"))
**Zed's** dead baby, _Zed's_ dead.
Or with some configuration options:
>>> import html2text
>>>
>>> h = html2text.HTML2Text()
>>> # Ignore converting links from HTML
>>> h.ignore_links = True
>>> print h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")
Hello, world!
>>> print(h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!"))
Hello, world!
>>> # Don't Ignore links anymore, I like links
>>> h.ignore_links = False
>>> print(h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!"))
Hello, [world](http://earth.google.com/)!
Originally written by Aaron Swartz. This code is distributed under the GPLv3.
html2text
is available on pypi
https://pypi.python.org/pypi/html2text
$ pip install html2text
PYTHONPATH=$PYTHONPATH:. coverage run --source=html2text setup.py test -v