
Show HN: Convert HTML to markdown with python - gaojiuli
https://github.com/gaojiuli/tomd/
======
krstf13
You should use HTML.parser and focus on the conversion to markdown. The way
you parse HTML in the convert function is very inefficient and can easily
produce incorrect results with valid HTML (e.g. <p class="some>stuff">some
text</p> )

------
williamstein
Since this is "Show HN", what is the motivation _you_ have for doing this?
There's no motivation or background in the linked README.

~~~
gaojiuli
When crawling online articles such as news, blogs, etc. I want to save them in
markdown files but not databases.

------
Bino
Wouldn't it make sense to escape markdown syntax in the HTML? In HTML which
you're converting from. __* has not special meaning.

<p> __foo __< /p>

~~~
gaojiuli
Tomd has the ability of converting the HTML that converted from markdown. If a
HTML can't be described by markdown, tomd can't convert it.

------
jwilk
How is it better than
[https://alir3z4.github.io/html2text/](https://alir3z4.github.io/html2text/) ?

~~~
gaojiuli
Maybe the number of lines of code is less. And the logic of tomd is easier to
understand.

------
roryisok
Nice to see projects that convert HTML to markdown. Projects to convert in the
other direction (md to html) are far more common. Parsing HTML is tough

~~~
4e1a
I completely agree. I had made something similar in lua and it was very fun
and I found many md to html converters, but not the other way around. HTML
parsing is what made me 'pause' development on that tool.

------
confounded
Pandoc may be of interest.

~~~
gaojiuli
Tomd is a python tool. Pandoc is a haskell tool.

~~~
cyphar
Yes. Though pandoc handles more formats and doesn't parse HTML using regex.
And it supports more features.

In general the language something is written in isn't necessarily a benefit by
itself. Is there a reason why you prefer Python when there already exists
another tool written in another language?

~~~
gaojiuli
When crawling online articles such as news, blogs, etc. I want to save them in
markdown files but not databases. For instance, I can download my wordpress
blogs as markdown files with requests and tomd in a single python script
without configering another language environment.

~~~
confounded
You don't need a Haskell interpreter, it's shipped as a compiled CLI by most
good package managers. There are also Python wrappers. Regardless, I'm very
glad your tool is working, it looks cool.

