
Show HN: Parse no-protocol URL, no-domain URI and email addresses etc from text - andrew-kang-g
https://github.com/Andrew-Kang-G/pattern-dreamer
======
darrenf
To the OP: I would suggest normalising the protocol. Your examples show that
you retain the case from the raw data:

    
    
        "protocol": "HTTP"
        "protocol": "https"
    

As it stands, users will end up normalising by themselves and you can save
them the effort. FWIW this part is typically called the scheme, not protocol
(e.g. Perl has `URI::Find::Schemeless` which performs a similar task)

    
    
        URI = scheme:[//authority]path[?query][#fragment]
    

See
[https://tools.ietf.org/html/rfc3986#section-3](https://tools.ietf.org/html/rfc3986#section-3)

To that end, you might also want to consider adding support for fragments. :)

~~~
felixfbecker
Well in JS, it's called protocol [https://developer.mozilla.org/en-
US/docs/Web/API/URL/protoco...](https://developer.mozilla.org/en-
US/docs/Web/API/URL/protocol)

~~~
JoBrad
Node calls it a protocol scheme.
[https://nodejs.org/dist/latest-v12.x/docs/api/url.html#url_u...](https://nodejs.org/dist/latest-v12.x/docs/api/url.html#url_urlobject_protocol)

~~~
noobiemcfoob
Congratulations! You're all right!

------
appleflaxen
I can't figure out what this tool does, even after spending several minutes on
the linked github page, and following the npmjs link on that page.

But clearly HN likes it (it's number two...); can anyone explain it to me?

~~~
darkwater
Looks like it's able to detect URL in "natural language" texts. Like for
example "this is alink[https://wikipedia.org"](https://wikipedia.org"). Check
a live demo at
[https://jsfiddle.net/AndrewKang/xtfjn8g3/](https://jsfiddle.net/AndrewKang/xtfjn8g3/)

~~~
np_tedious
I think the "no-protocol" in the post message suggests the http(s) part is
unnecessary. Makes sense - it's a pretty easy problem when the scheme is there

------
prashnts
I’m just wondering if it can parse things like « pls-no-spam at google’s email
service », which a human can easily parse to pls-no-spam@gmail.com. I suppose
POS tagging with some heuristics can get it working for a few cases.

And if it can, I feel the most users for this would be email harvesting spam
bots. Which would not be super nice. Although in ideal and less common
scenarios it could actually be useful for IR purposes.

~~~
albertgoeswoof
Not sure why people bother to use the << spam at google's email >> approach
when they can use an email forwarding service that lets you turn email
addresses off, like [https://idbloc.co](https://idbloc.co) (disclaimer, I
built this)

~~~
omegabravo
I think these are a great idea. The reason that people use that method is
because it is free, and the suggested alternative costs money.

I typically use throw away email addresses instead since they're free. If I
actually want to use the service/account some day I'll sign up properly.

I expect someday the primary email providers will generate aliases on demand.

~~~
myu701
Fastmail lets you convert myusername@fastmail.com to
whateverwebsiteorservice@myusername.fastmail.com

It's not a literal alias, but it works out fine.

One of these days I'm going to propose the Thunderbird + K-9 mail addons that
say 'set the sent from email address to be equal to the sent to email
address', that way if service@myusername.fastmail.com is the recipient, it
will be the sender to any replies, rather than the MUA default replying with
myusername@fastmail.com (but that's a separate discussion)

~~~
asdkhadsj
I didn't know that was possible, really neat!

------
sildur
It fails with ipv6 URLs like [http://](http://)[::1]:3001/

~~~
jakeogh
[https://github.com/jakeogh/hostname-
validate/blob/master/hos...](https://github.com/jakeogh/hostname-
validate/blob/master/hostname_validate_regex_mono.py)

eventually it will get integrated with:
[https://github.com/jakeogh/kcl/blob/master/kcl/htmlops.py](https://github.com/jakeogh/kcl/blob/master/kcl/htmlops.py)

------
playpause
Why "pattern-dreamer"?

------
jxub
It looks interesting, but the readme nor the title doesn't make clear what is
the project about. Would someone care to explain?

~~~
mgliwka
This project attempts to parse url's and email addresses from text, even when
the protocol is missing or they're mangled in some other fashion (i.e.
google.com/test-url?id=2 gets parsed and identified as valid URL, but
invalid.nontld/test not. Or "test@example.euSample Text" get's parsed as
test@example.eu)

~~~
YesThatTom2
Suggestion: change parse to extract

That would be even more clear.

~~~
ken
FWIW, Apple’s implementation of this is called “NSDataDetector”.

------
andrew-kang-g
Thanks for all the comments. Now ip-v6 and intranet cases are available to be
extracted from v 1.6.

