It can show actual videos using the Sixel protocol. You already have terminals that support it too. Just run `xterm -ti 340`. In that terminal, run an application such as Gnuplot:
Love Joplin! I love it so much, my lazy ass actually donated. I spent quite some time searching for open source alternatives that don’t have an ulterior motive. Currently using nextcloud sync and it works. Sometimes the iOS app and the Linux desktop app are out of sync, but a sync fixes that. Would love to see mTLS implemented at some point!
As is customary for all of Apache, I have no clue what I’m looking at after trying to read through the links in that page for ten minutes. Like who is this tool for? When should I use this vs any other competing tools? No clue. I suppose it can read documents of any type and give it out as a dictionary? Why would I use this vs pandas?
I can't speak to the Apache documentation, but I once had the task of extracting plain text from many different document formats: Word, spreadsheets, PDFs, the EXIF information in JPEGs, and so on for a long list. I had written code with calls to extractor libraries for several of these formats, when I can across tika. Out when my if..then..elif..elif..elif.. code, to be replaced with a single (Python) call to tika.
I can't answer your question about pandas, though.
I second this, there is absolutely no easily discoverable entry point to the documentation.
In the end if you want to get a feeling of what this is you search for "tika tutorial" and get a rough idea via (in my case) some medium article I guess.
I second this suggestion. I tested numerous Python tools to extract text - nothing matches Tika for general extraction of just about any data format.
However - if you can expect a certain format beforehand - then Python is better since you can extract higher-quality data (tables, lists) with the appropriate tool.
I've had good luck with python-docx for reading word documents (typically specifications). Tables are supported - but it's not obvious where the table comes from in the document and I had to come up with a hack way to read image captions.
PDF has been hit or miss, but pypdf has improved in the last couple of years. Depending on the document you'll sometimes get random spaces or nospacesatall.
I tried python-docx with a bunch of docx files (downloaded from Google Docs). It returns empty strings for hyperlinks and I couldn't manage to fix this. So if there is a sentence like "This is an important link to another doc or url." and the "link" is a hyperlink, python-docx returns "This is an important to another doc or url."
Heh, I got a bit into hacking on python-docx last year (the original author seems to be focusing on other things than python-docx now) - I have a fork/branch where I tried to more properly implement external hyperlink functionality (https://github.com/icegreentea/python-docx/pull/7)
I realize now staring at this, that I might have broken API a little. You can't do "text = paragraph.text" anymore, but you can do "text = ''.join([run.text for run in paragraph.runs])" instead.
If you're curious at all why it breaks, it's because in the OOXML spec paragraphs are made up of a ordered list of runs or hyperlinks (and hyperlinks can then contain additional runs). The master branch just implements paragraphs as ordered list of runs (and ignores all hyperlinks).
This sounds amazing! Thanks for sharing it, I will try it to see if I can replace it with the main python-docx. For my use case it suffices to have full text of each paragraph (even if it includes a hyperlink) and heading but also be able to have each of them separated when needed.
I found myself today trying to parse a TSV and substituting a few fields with a different value, then writing the new file out.
Something that perl would excel at, although I used Python. Because Perl isn't as maintainable as Python
I was intrigued by this comment. A JVM solution would also be viable in my tech stack. Would Tika be easier than line processing compiled regexes in Python? I tried looking at the Usage examples but it wasn't clear.
reply