
Aleph: A suite of data analysis tools for investigators - salzig
https://docs.alephdata.org/
======
capableweb
The GitHub readme/repository doesn't give a fair overview over what this
project really covers. Seems really ambitions and well made, at least from a
quick glance. This page gives a better overview:
[https://docs.alephdata.org/how-aleph-is-used](https://docs.alephdata.org/how-
aleph-is-used)

Some problems they aim to solve:

> Easy data search for both structured and unstructured information (ie.
> documents and databases).

> Cross-referencing between different datasets ("Who are all the politicians
> in my country that are mentioned in this leak?")

> Access control and data compartmentalisation, but also flexible sharing
> within cross-border teams.

> Continuous crawling of hundreds of public data sources as background
> material for research.

> Visual exploration of investigative analysis.

~~~
dang
Ok, we've changed the URL from
[https://github.com/alephdata/aleph](https://github.com/alephdata/aleph) to
the project home page.

~~~
salzig
Thanks

------
divbzero
I trialed Aleph recently and was impressed by its progress against an
ambitious goal. My impressions as a user were as follows:

1\. Aleph is excellent out-of-the-box for its

– OCR, via Tesseract or Google’s Vision API

– Full text search, via Elasticsearch

– Browser based UI, via React

2\. Aleph does a okay job but has room for improvement with

– Entity extraction

– Language detection

where “okay” means it’s accurate enough to be useful for filtering by names,
emails, languages, _etc._ , but you’ll probably encounter occasional errors.

I also noticed search latency in my deployment and would love to try the
Elasticsearch tips from the HN thread last week [1]. This latency does not
appear in the production deployment by the Aleph team.

[1]:
[https://news.ycombinator.com/item?id=22396918](https://news.ycombinator.com/item?id=22396918)

Again, props to the Aleph team for their success so far.

~~~
bryanrasmussen
Given your trial would you think it useful to investigate moving a search
solution of standards and regulations in PDF and text documents to aleph? That
is to say is it a good enough search solution for structured and unstructured
data that it would make sense to build on top of it instead of rolling your
own?

~~~
divbzero
Yes, definitely worth investigating if you need full text search as well as
content extraction from PDFs. I found the production deployment installation
[1] to be the most straightforward.

[1]:
[https://docs.alephdata.org/developers/installation#productio...](https://docs.alephdata.org/developers/installation#production-
deployment)

Feel free to ping me if you decide to try it and have questions. In addition,
the Aleph team is active on both GitHub and Slack.

------
ssutch3
A LOL from the docs:

> Can I run Aleph without using Docker?

> Can Britain leave the European Union? Yes, it's possible; but complicated
> and will probably not make your life better in the way that you're
> expecting.

------
salzig
As a side note, I stumbled on this cause the German Public Television seems to
work on this too. Found it quite interesting to see that, in addition to
finding this project

[https://github.com/NorddeutscherRundfunk/aleph](https://github.com/NorddeutscherRundfunk/aleph)

------
adultSwim
[https://www.icij.org/blog/2016/04/data-tech-team-
icij/](https://www.icij.org/blog/2016/04/data-tech-team-icij/)

ICIJ put together a great platform to investigate the Panama and Paradise
Papers

------
Jugurtha
Note on the name: In addition to the origin story, Aleph is also the first
letter in Arabic and Hebrew (א, ا)

~~~
david-cako
Which represents the “breath of life”, and all possible things it can become.

~~~
Jugurtha
The ultimate "Initial commit". Care to share some reading on this that you
found interesting/intriguing/enlightening?

~~~
david-cako
Kabbalah and the associated mysteries; BOTA has great coursework on the tarot
that teaches to the Hebrew roots.

To me the tetragrammaton is a logical proof that approximates an “initial
commit”.

~~~
capableweb
I'm an outsider that doesn't have much knowledge about faith, please humor me.

I looked up "Tetragrammaton" since I never heard of it[0], but I don't see how
it would consist of a logical proof, I don't quite understand the
significance. How do you reach your approximation from the tetragrammaton?

\- [0]
[https://en.wikipedia.org/wiki/Tetragrammaton](https://en.wikipedia.org/wiki/Tetragrammaton)

~~~
david-cako
The tetragrammaton is a 4-letter name of God, and a conceptualization of

\- _somethingness_

\- becoming observable/“contrasty”

\- germinating

\- and becoming self-aware (and an applied practice)

Like a stack overflow of reality. It is one of many names of God in Hebrew.

This essentially represents “the all”, as we can understand it. It’s
interesting how ancient alphabets were built on _base_ concepts and
interactions. They have a sort of philosophy to them.

Markov chain text generation is pretty wild, because we now can build computer
systems that endlessly generate their own many-letter “names of God”. God has
many names in Hebrew; everything is a name of God, so to speak, like a DNA
sequence.

------
DyslexicAtheist
been using this for some time to find info on companies/CEO's and other
characters that appear in my news feeds.

here is a working example:
[https://aleph.occrp.org/](https://aleph.occrp.org/)

It also has a great client API which allows you to index a large volume of
pdf's all at once:

    
    
      $> alephclient crawldir --foreign-id <id> directory_with_pdf/

------
monkeydust
Looks interesting for personal or company wide search across multiple document
types.

------
traverseda
Looks like a great alternative to open semantic search.

------
OliverJones
Dear HN colleagues: let's be careful about swamping those Aleph folks with
traffic. They probably have enemies around the web that would exploit any
overload and outage. Slashdotting can definitely turn into an unintended dDOS
attack.

Better yet: maybe somebody with access to some kind of attack-resistant CDN
provider could help them migrate.

If they haven't already.

~~~
traverseda
What do you mean? The actual aleph server is a deploy on premises docker
image, so we can't overload the actual services people are using. The
infrastructure is all github or docker, which already scales.

Are we worried about dos-ing the documentation website?
[https://xkcd.com/932/](https://xkcd.com/932/)

