Hacker News new | past | comments | ask | show | jobs | submit login
Prolog for data science (emiruz.com)
190 points by usgroup on May 7, 2023 | hide | past | favorite | 55 comments



Logic programming offers a good foundation for anything that people call "rule engines." Within logic programming, there is some variation on the degree of declarativeness.

Datalog is arguably the minimal core logic programming, similar to what the lambda calculus achieves for functional programming. Unfortunately, it has been forgotten outside of database and query processing realm. A resurgence has happened in recent years, as PL researchers and also industry have discovered the virtues of datalog (e.g. Flix, DataFun). My own attempt at making this more widely known is here https://github.com/google/mangle, a language from the datalog family and its implementation as a go library.

As the example shows: plain "rules" (or: plain datalog) is rarely enough to capture everything that one wants to express: the question then is, how to combine a pure declarative "kernel" with more general purpose programming (e.g. mapping a list).

PROLOG offered one answer, already in the 1980s, but I fully reject it: the fact that the writing a program in the wrong order with negation and recursion makes it non-terminating is not something we'd want everyone to deal with. Datalog with stratified recursion is somewhat better, as "layers of rules" is a concept that is easy to understand.

In mainstream programming languages, the possibility of writing non-terminating programs also exists, but is rarely an issue. That is why I believe a good combination of declarative and general-purpose has to make it really easy to recognize which parts of a program are in the declarative, terminating, safe kernel and which parts require more attention from the programmer.


The problem for Datalog adoption is there are dozens of incompatible Datalog engines. And most (all?) are not good or well maintained. It's sad.


I think it is helpful to see datalog as a formal, conceptual kernel (or "toy programming language" in the famous Alice Book "Foundations of Databases"). When we look at the functional programming languages, we do not usually see is as a dozen of incompatible implementations of lambda calculus.

The kind of standardization that happened for SQL and PROLOG certainly helped spread its use, but it is a very differently world today. Developers can easily do their own take on DBs' data warehouse by serving from memory or existing DBs or files.

If you do not see it as a programming language, but a way to think about computation, then we are in the ballpark of "rule engines:" there are of course innumerable implementations of things that are called rule engines. Like the post, "rules" make knowledge explicit, but we wouldn't even think about the possibility that all the folks who wrote or use these rule engine implementations use datalog syntax. It is more the semantics, structuring the problem as facts and rules, that counts.

Of course having a common syntax helps and matters in getting the message out: that there is a good foundation. But how to add aggregation or user-defined functions is not settled (it is also not settled for SQL, many vendor-specific extensions, syntaxes...) and I think today's business world does not provide much incentives for agreeing and standardizing.

Maybe academia will be able to help over time, by teaching newer generations who will then pick standard syntax for their next PL because they are familiar. For academia to be interested in an applied PL topic, it has to be reasonably formal, derivable from first principles and teachable.


Would it be a good candidate for modelling flows w.r.t economics? For instance, country 1 has currency pinned to country 2, import/exports, etc.


You mean datalog or Mangle? Mangle adds a bunch of things that make various things easy to model in the sense that you have entities and connections and data and query that. Maybe one could call this a "knowledge hypergraph".

Datalog is very basic and everyone needs aggregations and structured data so Mangle also supports aggregations, structured data and also some function calls. When you use these it is clearly no longer datalog, but it is easy to see what part is datalog and what part is "more."

To qualify this a little: Mangle does not provide much for mutability (neither the "spec" which is a bit implicit, nor the implementation) so if you want to make a real DB with inserts or an RPC interface you have to code that yourself.

I find having a readable source file and running queries is good for playing around and also may cover many use cases for small DBs with static or slow changing data, or configuration.


Were you able to make use of the fixpoint optimization ideas in Datafun? I thought it was a great idea but the project seems to have stalled.


Mangle uses seminaive evaluation which is a standard, not very fancy but incremental way of computing fixpoint. I believe Datafun uses the same but there it requires more thinking since it interacts with other language features. I need to reread the paper. There are really many low level ways to optimize incremental fixpoint computation in practice since the key step involves a look up whether a fact has already been computed. However, having evaluation access the FactStore through an interface means a developer can hook up various implementations and that flexibility can matter more than the last bit of performance.


it looks like authors of the project are not employed by Google, how this project managed to be under google at github?..


FWIW, there's the entire field of Inductive Logic Programming focussed on learning/training (propositional or predicate logic) theories in Prolog syntax from examples presented as Prolog facts, with established packages such as Aleph and ProGolem also implemented in Prolog. See eg [1] for an ISO Prolog port and recent optimization/parallelization of Aleph.

[1]: https://quantumprolog.sgml.io/bioinformatics-demo/part1.html


I finished my master's thesis in Inductive Logic Programming recently at Oxford. I'd say that the field has continued improving since Aleph, which was written in the late 90s.

Anyone interested could also take a look at Popper (https://github.com/logic-and-learning-lab/Popper) or this overview of the first 30 years of ILP (https://arxiv.org/abs/2008.07912)


It makes me glad to see that Aleph (and ILP more generally) are still being used. It's a really powerful and interpretable way of constructing code from examples.

I made a few contributions to Aleph as part of my PhD on doing transfer learning in ILP and really enjoyed working in Prolog.

https://mark.reid.name/bits/pubs/unsw07.pdf


Reading the first page of your thesis: "Central to this approach is a novel theory of task similarity that is defined in terms of syntactic properties of rules, called descriptions, which define what it means for rules to be similar". I wonder if you could use the current LLMs to obtain task similarity by semantic properties obtained by LLMs. (Just in case you develop the idea, put me as a coauthor for the main idea )).

Jokes apart, what makes me able to detect that kind of applications of LLM is that I have been looking how to combine LLM with rule based systems, using statistical methods, so I analyze anything that smells like that.


The likes of Popper for ILP are quite incredible in their own right although I would highlight to other readers that the article is about making reasoning explicit using Prolog as opposed to machine learning.


A different, also very explicit way to go about this type of problem, that also generalizes fully, is to use a Bayesian hierarchical model of a dirichlet process and sub-isotonic regressions.

Gelman et al have written a lot about this, and they have a proposed general workflow [1]

[1] Bayesian Workflow https://arxiv.org/abs/2011.01808


I think the focus of the article is the introduction of Prolog as a general tool for data science. The examples are just incidental.


It really is a shame that none of the other 4th gen declarative languages (outside of SQL) really took off.

There is a certain clarity of purpose declarative code has that I find really pleasing.

Also I'm lazy and I prefer if the computer thinks for me.


Not knowing Prolog (or any other "4th gen declarative langauge", besides from SQL), I don't know if it's any different but my main peeve with anything fully declarative is that it can be deceiving. One expression computes fine and then you change it in a seemingly insignificant way and suddenly it's three orders of magnitude slower, because underneath the surface, it still translates into loops and jumps, only now you don't know which ones.


Prolog's execution model is well defined so there shouldn't be any surprises to the proficient programmer. I'd argue it's less surprising than what a C optimizing compiler does.


There’s absolutely nothing clear about this.


You are correct, there is nothing clear about your comment...


I am probably missing the point from lack of understanding of Linear Regression, but in the first example the Prolog code doesn't make sure the gradients are the same, which means the Python code must be spitting out segments only within each linear grpah section. So isn't the Python code doing the actually hard work of separating the sections and the Prolog code doing a lot of work to find the cutoff points?

If so, would sorting the segments by start position first make it much easier? Start at the far left, go right merging the overlapping ones until there's one which doesn't overlap?

(Please don't name the graph axes X,Y and then use (X,Y) in the code for things which aren't X,Y data points - anything wrong with sticking to (I1, I2)? Why use "R" for span length? And what's "A"? And nitpick "(I1, I2)" is not a tuple in Prolog, it's a term and "I1-I2" is a more idiomatic and fewer parens term which can be used in the same way. It's what the imported pairs library uses, for example - https://www.swi-prolog.org/pldoc/doc/_SWI_/library/pairs.pl )


The first example —- with the linear regression —- is a toy problem which sets up the second example. I.e. the method works for both. You could solve the toy problem in much simpler ways in isolation but it wouldn’t help explain the second example.


Somewhat related to this, I wrote a few years ago about how I envision that SWI-prolog could make a perfect integration point for reasoning and integrating multiple data sources in very intelligent ways, particularly if data is published in linked data form:

https://rillabs.com/posts/linked-data-science


I always appreciate it when other data scientists talk about making their data tell a story that upper management already decided they wanted confirmed for them. Makes me feel less alone. Thank you


I wish this was more widely understood. People act like "the data tells us" actually holds some weight. And of course there are places where something is so obvious that of course it's reflected in the data. But anything subtle is probably just as subjective as giving an opinion. The more degrees of freedom, the more "the data" can fit whatever somebody wants.

In light of all that, making decisions by gut feelings or intuition is, if nothing else, at least honest, and probably just as good an approach as anything else.


> In light of all that, making decisions by gut feelings or intuition is, if nothing else, at least honest, and probably just as good an approach as anything else.

I went down on this thought once, and I became antiscience. Reasoning-wise I only trust what I can understand, simple highschool level reasoning. Anything that I can't understand, and is more sophisticated than a highschool level reasoning I don't trust. I mean, sure, there are facts that need complicated statistical methods. But I can't check them, I don't trust authority, (and probably they are more often wrong than right anyway, because they need results or they starve to death), so I reject them/I am ambivalent.


> there are facts that need complicated statistical methods

Like what? There are ideas that need causal reasoning, or trust that it exists behind things you don't fully understand. Statistics, especially anything higher order, end up being basically just rhetoric, outside maybe of some very narrow claims.

If someone tells you you need a bunch or statistics to understand something, they're almost certainly trying to persuade you without having a strong enough logical footing to just explain themselves.

And science isn't a religion or a position, you can't really be for or against it. That should be a starting point.


For example, understanding why certain estimators give you the average treatment effect on treated individuals rather than the average treatment effect on everyone is easier if you understand the mathematics of it. But you need to know basic probability and mathematical statistics.

Causal mediation analysis is similar if you want to understand what assumptions you really need to make to talk about the mechanisms through which a treatment effect acts on the outcome.


I rest my case. If someone told me that the only way I was going to notice something is with the stuff you are saying, I'd be pretty reluctant to trust them.

If you're talking about optimizing around some agreed upon thing, sure I can understand the idea of using some more complex statistical analysis (though I might question the incremental value).


I am not saying that a convincing answer has to be convoluted. But a simple answer is not always the right one.

To understand when a simple answer is unlikely to be correct, you do benefit from understanding the mathematics. There’s a reason that people with doctorates in statistics spend 8 to 10 years in school to learn how to contribute to it.


I agree that saying that something is a gut decision is better than pretending that a gut decision was based on data. The former is easier to change.


I don't think intuition tells us much. What we have are perceived interests we seek to rationalise.

If we were all to iteratively acknowledge our interests to a greater degree then we could probably learn a lot more from data but I think there is a strong message in our society that self interest is wrong and so people dedicate great energy to pretending that their self interest does not exist.


I made the c-suite SaaS metrics generator especially for those types of managers!

https://github.com/patricktrainer/csuite-saas-metric-generat...


Unfortunately the case -- you just have to be honest and say "no it can't do that" when it can't. The "impedance mismatch" between managers/stakeholders and contributors is an ever renewing source of BS for many of us.


This is why I'm reluctant to go into data science, information security, etc. I feel like I will just be disappointed and disenchanted.


change upper management to politicians and you have an accurate description of modern science


I was a neuroscience researcher for 15 years, and the opinions or recommendations of politicians never had anything to do with what I did.


Modern science isn’t like that at all.

It’s public perception of modern science that is that way (in large part because of politicians).

Still bad, but a different kind of bad.


Are you talking about vaccines?


Words mean things. Neuroscientists study the mind.


Indentation of comments also means something. GP was not responding to the neuroscience comment.


Did the comment I replied to mention neuroscience? Please put your vote back


Related context: SWI Prolog has a lib to query and reason about a machine-readable form of PubMed through RocksDB. If you do meta-research, or AI-related biomed stuff, you should really check it out!


Not sure if you are talking about this (https://www.swi-prolog.org/pack/list?p=pubmed) or something else ? R tends to have solid library for meta-analysis research.


I've always been curious about this sort of use of Prolog/LogicProgramming. I wish to find more, thanks for sharing.


I have no idea what the author is trying to do or why.

Why do you want to do a linear regression on random segments?

Author talks about reasoning, but I think it would make the point clearer if there was a section that reasoned about the result on FTE.

Why are there a white upward segment and a white downward segment? Inside the blue segment there is a clear downward segment, about the size of the red downward segment. Why did it not get its own color?


>Why do you want to do a linear regression on random segments?

It may be worth reading the article fully -- why random segments and the logic of it is explained fairly well I think.

>Author talks about reasoning, but I think it would make the point clearer if there was a section that reasoned about the result on FTE.

Same as above. Have a read of the comments around the implementation.

>Why are there a white upward segment and a white downward segment?

There are no white segments -- those are gaps.

> Inside the blue segment there is a clear downward segment, about the size of the red downward segment.

I guess you mean the last graph? I think if the isotonic regression was plotted on the graphs, it'd be clearer why that segmentation makes sense. But in brief it is about how well an isotonic regression fits in the segments as opposed to how it may visually seem -- the two can come apart.


There is a prolog based database that will take away many foundational problems in implementing a rules system or other logic programming artifact. It is called TerminusDB [1]

[1] https://github.com/terminusdb/terminusdb


This is a bit OT, but does anyone know free implementations of logic programming languages other than Prolog? Portability is a bonus.

I already know:

* Strand

* http://www.call-with-current-continuation.org/fleng/fleng.ht...

* Racket's implementation of datalog

TIA.


Probably the next best thing is Mercury. https://mercurylang.org/ I'd also say that if you get a chance to definitely learn Prolog! One nice thing about Prolog is that you can start more from an application angle and explore the language in a way that might motivate you and interest you more than the dryer approach of more general language learning. Classic NLP can be a lot of fun, for example. https://cs.union.edu/~striegnk/courses/nlp-with-prolog/html/ is a good free starting point, but the best starting point is one of Covington's classic texts on the subject.


You can embed a microKanren in whatever language you like.


Recently, only advanced machine learning approaches are attracting attention, but it was a good practical content.


[flagged]


Turning compile errors into runtime errors since 1991.


Well, the alternative presented here is Prolog, so no, I don't think your comment applies.


[flagged]


[flagged]


Please don't respond to a bad comment by breaking the site guidelines yourself. That only makes things worse.

https://news.ycombinator.com/newsguidelines.html




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: