
Xltrail – An open-source Git command line extension for Excel workbook files - bjoerns
https://www.xltrail.com/git-xltrail
======
phonon
So this does NOT create diffs for the underlying excel workbooks, just for the
VBA in the workbooks?

------
skety
I have to handle quite a lot of Excel document at work, much to my despair.
Unfortunately we are not using GIT, but SVN. I ended up creating a small tool
that extract the vba, named ranges plus the sheet in this format:
"cell_adress,type,format,value" I have to commit those files alongside the
spreadsheet, which is not great but had helped me a lot. Nice to hear others
are working on the problem!

~~~
kdmytro
Have you tried `textconv` instead? You can configure git to generate your CSVs
on the fly and show the differences between CSVs when you do `git diff`. See:
[https://www.kernel.org/pub/software/scm/git/docs/gitattribut...](https://www.kernel.org/pub/software/scm/git/docs/gitattributes.html#_performing_text_diffs_of_binary_files)

~~~
bringtheaction
Speaking of text conversion and git, I usually don't commit non-textual data
aside from necessary files like image and audio assets, but one time I
commited some PDFs in a "samples" directory for a tool I made to extract some
data from a set of PDF files, and later I removed one of them and observed
that when I typed "git show" the diff showed the text contents of the PDF
which I find rather mind blowing because of how much trouble I had experienced
extracting text and how git was casually showing me an ascii rendering of the
document more or less with good representation of the layout of the document.

This in fact prompted me to further investigate the open source text
extraction tools on the market and I ended up finding one that was better than
the one I had selected at first and which I had then been building upon.
Happily my own tools were built in such a way that I could reuse most of the
code I had written while using the previous tool, and in fact during the
rewrite I also realized that I could write the new code in a much cleaner way
and so there were basically only upsides to switching tool and rewriting some
of my code :)

~~~
voxadam
What tools did you end up settling on for PDF data/text extraction? I ask
because I have a side project that I've been neglecting for far too long which
depends in part on cleanly extracting text from PDF (other formats too but
PDFs are by far the most headache inducing).

~~~
bringtheaction
I use the pdf.load and pdf.tree.write of PDFQuery.

[https://github.com/jcushman/pdfquery](https://github.com/jcushman/pdfquery)

> PDFQuery works by loading a PDF as a pdfminer layout, converting the layout
> to an etree with lxml.etree, and then applying a pyquery wrapper. All three
> underlying libraries are exposed, so you can use any of their interfaces to
> get at the data you want.

Here is a minimal Python script to dump the XML tree so you can load it in
whatever other language you use and work with it from there.

    
    
        #!/usr/bin/env python3
    
        from pdfquery import PDFQuery
    
        pdf = PDFQuery("some_document.pdf")
        pdf.load()
        pdf.tree.write("some_document.xml", pretty_print=True, encoding="utf-8")
    

It doesn’t work _perfectly_ with _all_ documents but it works well with many.
Give it a try.

    
    
        pip3 install pdfquery

------
mdip
I have to say, pretty neat. And I partly feel like I may need a shower after
thinking that.

I have no doubt that a product like this has a _real_ need ... I used to write
software targetted at end-users for reducing first-level support at a huge
company, and it never ceased to amaze me the kinds of unbelievably important
things that Excel and Access were used to manage (on the order of processes
that the failure of which would result in 7-figure problems). It would further
amaze me to find out that many of the people working with these end-user
created solutions to problems were had some really, durably, solid rocks
between their ears, and yet were still able to navigate these tools with
_moderate_ success. Unfortunately, there's little chance that an average excel
hacker is going to have much interest or ability to operate git, and the
average git enthusiast is probably a lot like me -- a guy who goes out of his
way to _never_ launch Excel for any reason, ever.

The last time I was given a task in which the work product was demanded in
Excel, I wrote an application to create the spreadsheet programmatically ...
partly because it was the only way to do some of the "clean up" they were
asking for[0] and partly because I am a smart-ass by nature.

I realize I'm a _specific case_ and I probably have a much stronger allergy to
Excel than most, and as I said, there's few large companies that don't have
stupidly-complex processes that (also stupidly-)rely on Excel. I'm also a guy
who bends _git_ into just about every scenario I can find, so I love the idea
of anything that makes it more useful. Kudos.

[0] I had to correlate output from several web services and the source was a
spreadsheet had the values they responded with. I wrote an app to read the
rows and apply edit-distance against them to correlate a few more than were
possible using just excel tricks.

------
seren
Getting the diff is nice, now if someone has a solution to merge file, I am
all ears.

~~~
bjoerns
op here: feel free to log an issue on GH, it's definitely on our agenda:
[https://github.com/ZoomerAnalytics/git-
xltrail](https://github.com/ZoomerAnalytics/git-xltrail)

------
ninjamayo
Nice attempt, I looked into this problem too a while ago. Quick question for
the OP, I know a lot of enterprise users want to version control their
spreadsheets rather than just the code. Have you looked into that? Seems to be
a tricky problem to solve.

~~~
bjoerns
Our server-side (commercial) product actually does handle spreadsheets as well
as VBA code. If you want to see it in action, have a look at our demo
repositories: [https://demo.xltrail.com](https://demo.xltrail.com)

~~~
TAForObvReasons
The VBA component seems to use
[https://github.com/decalage2/oletools](https://github.com/decalage2/oletools)
. What are you using for handling spreadsheets?

------
gt_
Can anyone here answer whether it’s normal, when building git based
interfaces, to only use the API for queries and contextual things but utilize
regular git commands to perform interactions with the remote repositories?

I am a 3D artist building a system to version control the configuration/script
file directory trees for all the applications I use on a day to day basis. I
will be taking a look at this code for clues but maybe someone in this thread
can help me with some questions.

Does anyone know something similar to this? I am modeling it on the VSCode
Setting Sync extension, which is simple but a little rough itself.

I am using Python and just about have what I’m going for but it’s not as tight
as I want it to be. I got a good system of using one repository with a branch
per directory tree and folder symlinks to organize it locally. I can pull the
branches into individual local repos per remote branch and it works nicely.
This allows me to use regular git commands when I need to without fussing with
other application’s configuration trees.

But things get pretty loose when it comes to building the backup() and
restore() functions. For restore(), I can just doing an overwrite with fetch
then reset hard —HEAD or various things like this but it seems like there must
be a better approach considering the local repository, index etc. are a bit
unnecessary. If anyone has any thoughts, links or pointers, I’d be super
thankful! Sorry for the long post.

~~~
mschuster91
> Can anyone here answer whether it’s normal, when building git based
> interfaces, to only use the API for queries and contextual things but
> utilize regular git commands to perform interactions with the remote
> repositories?

Because there are a myriad ways to connect to remotes: local filesystem, git
protocol (rare), git+ssh, http(s), whatever-git-clone-"user@host:myrepo.git"
does, svn-bridge are the ones I know and it's well possible there are more.

So you'd need around 4-5 protocol reimplementations, plus some way to
configure the weird stuff that people do with SSH (public keys with or without
password, ssh-agent, custom ports, port knocking, bastion hosts, whatever else
one may fit in .ssh/config)... best leave this to git, especially as there's a
lot of crypto in there that you're bound to mess up.

------
erichurkman
Does this site mess with scrolling? On a macbook air, chrome, it scrolls
wildly even with a tiny nudge.

~~~
narrowtux
Yes. I got so fed up with bullshit like this that I made a blocklist for my ad
blocker just to keep sane while browsing.

[https://gist.github.com/narrowtux/19220fc077ecbaeb274c56e849...](https://gist.github.com/narrowtux/19220fc077ecbaeb274c56e8497f8e61)

The latest entry blocks the smooth scroll on the site of this post. They've
gotten smarter now and include this in bundles.

------
ericfrederich
Hmm... I literally just created an AWS bucket that magically creates (i.e.
lambda) corresponding .json files for any .xlsx file that gets put there.

I may have used Git if I knew this existed. For now our requirement was that
the file be maintained in an Excel format, but this data needs to be consumed
by different scripts written in different languages hence the automatic .json
conversion.

~~~
mulmen
I see this frequently but I question the choice of JSON which I assume is
actually newline delimited JSON objects. What is the benefit of JSON?
Familiarity?

If you need to ingest tabular data is JSON the ideal format? Why not something
like escaped TSV? I realize storage is cheap but JSON has a lot of overhead
and I don't see the benefit in using it to store data files, especially from
spreadsheets.

Depending on access patterns something like Parquet may be a lot more useful.

