Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
HTML parsing in Elixir with leex and yecc (eellson.com)
67 points by eellson on Jan 22, 2017 | hide | past | favorite | 6 comments


Another great use of leex and yecc -- a SQL parser from folks at Basho:

https://github.com/basho/riak_ql

Specifically:

https://github.com/basho/riak_ql/blob/develop/src/riak_ql_le...

https://github.com/basho/riak_ql/blob/develop/src/riak_ql_pa...

It is a very concise and well written piece of software.


The best library I've found for this sort of thing is gumbo. https://github.com/google/gumbo-parser

With its help I've created scrapers and crawlers that digest even the most disgusting HTML.


Hmm... This seems more like XML parsing to me than HTML parsing - in particular, there's no handling of (completely valid) omitted end tags.

Definitely interesting though.


The article mentions Floki which incidentally just added support for the servo/html5ever parser written in rust.

https://github.com/hansihe/ex_html5ever

Excellent article about creating parsers though even if html parsing is a particularly difficult problem.


Floki is a great lib, used it to write a very basic URL polling CLI tool in just 72 lines of code: https://github.com/vikeri/proba/blob/master/lib/proba.ex


As the author says, this is a toy project to learn Elixir; don't use in production, especially not on dynamic/user content.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: