Hacker News new | past | comments | ask | show | jobs | submit login
Duckling: a Clojure library that parses text into structured data (wit.ai)
121 points by getdreambits 51 days ago | hide | past | web | favorite | 24 comments

The thing that I like about Duckling is that it is a rules based system, which can easily be interrogated. Model based text extraction is much harder to fix when there is a bug. I use Duckling as a service in value extraction from queries and content alongside a model based system for NER (such as spaCy). Using both together makes for more accurate enrichment in general (by cross referencing between the two for values, and adding exception rules)

It's exactly how we used it internally for wit.ai

(Very small correction: Duckling is rule-based but uses a super simple Naive Bayes classifier to prioritize between the many potential parses produced by the rules -- we see it as a hybrid approach)

Hello HN, I'm the original author of Duckling (with @blandinw). As usual, always happy to get feedback and suggestions.

Interesting! When I worked at IBM, we evaluated Duckling (the Haskell version) for use in the Watson Assistant product but decided to write our own numerical quantity parser/interpreter. We used ANTLR and created context-free grammars as we found that we could improve both precision and recall substantially. Sadly not open source though.

I must say it looks very eat from the point of view of usability. Are the training data sets open? Do you see feasible for small app coders (who don’t have thousands of examples to train) to use Duckling as more or less NLP parser without getting too much deep into the NLP and AI theory?

Are the trained sets mean to be used by different client code or languages?

Yes all the training data is in the repo.

Duckling is relevant to parse very structured language, typically temporal expressions (dates and times...). It relies on a mix of rules and machine learning. Rules and datasets for many (human) languages are available in the repo. You don't need a lot of data to add support for what you need, owing to this hybrid rules+ML approach (as opposed to just ML).

Hey, author of the Haskell re-write (https://github.com/facebook/duckling). We've implemented custom dimensions for extensibility (example: https://github.com/facebook/duckling/blob/master/exe/CustomD...).

Hi, thanks for dropping in. What's the status of Clojure implementation? Would you recommend new projects to use it? Is anyone looking at new/old issues? Are there potential new maintainers for Clojure version?

The current Clojure version is quite stable, we used it at wit.ai/Facebook for several years before moving to Haskell.

I'd love to see somebody taking over and resuscitate it! One interesting direction could be to remove Java dependencies (mostly to Date) so that it's usable in ClojureScript. It would make a great JS library.

Out of curiosity, why the move from Clojure to Haskell?

I touched on that on Reddit: https://www.reddit.com/r/Clojure/comments/68r4lz/one_of_face...

TL;DR Haskell made more sense for us to scale with the number of requests (existing FB infra) as well as the number of engineers working on the project (type checking, etc).

> Everything is hashmap-typed and you need to do runtime checks to make sure the data you receive is what you expect. You cannot trust any input

I don't mean this in a type system flamewar but in java if you receive Person object and want to get firstName. Your options are still the same.

1. runtime check to see person.firstName is not null.

2. or blindly assume that it can never be null and do person.firstName.trim()

So those are the exact options in clojure too. What am I missing here.

Also check that person isn't null, probably.


What (s)he means though is in Java if firstName is there then it will be a string.

In Clojure firstName might be anything, a string, a number or even an entire other hashmap or type, literally anything. This might or might not cause a runtime crash if you are doing something that assumes it's a string. So you check.

ok so it saves an extra (typeof x) check.

To be honest I rerely see the typeof check in clojure code.

And "saves" is a bit of a misnomer, since it implies the "cost" (of all the static type machinery) is less. Well, dynamic fans (or those with dynamic preferences if "fan" is too strong) will disagree. ;) In practice many systems get streams of bits from somewhere (like the network) that commonly get interpreted into strings and from there other types. The validation and conversion is necessary in any language, after that though it's just FUD to bring up that a function expecting a person with a :name key and string value potentially could be given something else. In the cases where through changes we make it something else, static types are a nice extra assurance on consistency, but that isn't the only way or the most impactful way to gain assurances.

Also this quote was about the cited reason for FB preferring Haskell to Clojure generally, not about this project specifically. IME it's common in big organisations to prefer stuff that has controls at the cost of productivity. It's probably harder to ship buggy Haskell code even if it's harder to write.

Scalability. More context on the move from the 2017 post https://medium.com/wit-ai/open-sourcing-our-new-duckling-47f...

It knows about "Labor day" but not "Labour day".

Which language ruleset are you using? I imagine the latter is not in EN_US, but would be in EN_GB

It’s a Haskell library now. https://github.com/facebook/duckling

I came here to mention the same thing. I experimented with the Clojure version a long while ago, and evaluated the Haskell version about a year ago for a project at work. Good stuff.

Glad to see there life beyond python for corporative AI usage.

You're a bit late to the party…

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact