Hacker News new | past | comments | ask | show | jobs | submit login

Another way to parse Markdown, HTML, or docx files would be pandoc [1]:

  pandoc --to json file.docx
or in Python:

  import json
  from sh import pandoc
  doc = json.loads(  pandoc("file.docx", to="json").stdout  )
Example output (reformatted slightly to reduce number of lines:

  {'pandoc-api-version': [1, 22, 2],
   'meta': {'title': {'t': 'MetaInlines',
     'c': [{'t': 'Str', 'c': 'The'}, {'t': 'Space'}, {'t': 'Str', 'c': 'Title'}]}},
   'blocks': [{'t': 'Header',
     'c': [1,
      ['first-chapter', [], []],
      [{'t': 'Str', 'c': 'First'}, {'t': 'Space'}, {'t': 'Str', 'c': 'Chapter'}]]},
    {'t': 'Para',
     'c': [{'t': 'Str', 'c': 'I'}, {'t': 'Space'}, {'t': 'Str', 'c': 'like'}, {'t': 'Space'},
      {'t': 'Emph', 'c': [{'t': 'Str', 'c': 'cursive'}]}, {'t': 'Space'}, {'t': 'Str', 'c': 'or'}, 
      {'t': 'Space'}, {'t': 'Strong', 'c': [{'t': 'Str', 'c': 'bold'}]}, {'t': 'Space'},
      {'t': 'Str', 'c': 'text.'}]},
    {'t': 'Para',
     'c': [{'t': 'Str', 'c': 'Here'}, {'t': 'Space'}, {'t': 'Str', 'c': 'is'}, {'t': 'Space'},
      {'t': 'Str', 'c': 'a'}, {'t': 'Space'}, {'t': 'Link',
       'c': [['', [], []], [{'t': 'Str', 'c': 'link'}], ['https://ix.de/', '']]},
      {'t': 'Str', 'c': '.'}]},
    {'t': 'BulletList',
     'c': [[{'t': 'Para', 'c': [{'t': 'Str', 'c': 'Item'}, {'t': 'Space'}, 't': 'Str', 'c': '1'}]}],
      [{'t': 'Para', 'c': [{'t': 'Str', 'c': 'Item'}, {'t': 'Space'}, {'t': 'Str', 'c': '2'}]}]]}]}
[1] https://pandoc.org/



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: