

Does anybody knows how to search for HTML-ENTITIES? - ericol

Hello people.
Sorry is this too dumb a question. I&#x27;m trying to find information about an encoding in PHP that can be used with mbstring related functions. The thing is, it seems ALL search engines act as if the hyphen doesn&#x27;t exists.<p>I tried Google, DuckDuckGo and Bing, similar results in all of them.<p>PHP site&#x27;s search is powered by Google. Too bad. If you search for HTML-ENTITIES, found here:<p>http:&#x2F;&#x2F;ar2.php.net&#x2F;manual&#x2F;en&#x2F;mbstring.supported-encodings.php<p>In the first page of results that page is not listed. In the second page you start to get pages in other languages (like, you know, Japanese, or German). Shame on you, php.net.<p>So, does anybody knows if there&#x27;s a programmer-friendly setting or engine around?<p>Thanks in advance.<p>P.S.: Again, sorry if this a stupid question. But I think it&#x27;s even more stupid spending more than 2 minutes trying to find out <i>how to do a search</i>.
======
lutusp
Are you asking for an explanation of HTML entities, or a listing of HTML
entities that browsers recognize, or some third thing I cannot imagine?

Also, now that (new) Web pages are nearly all UTF-8 encoded, HTML entities are
no longer used, sometimes even deprecated in favor of Unicode plain-text. The
advantages are smaller size, easier editing, and a solution to the question
about which browsers support which entities.

> But I think it's even more stupid spending more than 2 minutes trying to
> find out how to do a search.

So you think it should be someone else's two minutes? But seriously, ask
yourself if you should even be considering using entities in 2014. Doesn't
your HTML development environment support UTF-8?

~~~
ericol
Thanks for your reply.

HTML-ENTITIES is an accepted value in the list of accepted encodings for
mbstring related functions in PHP; some of the functions accept an encoding on
one parameter or two whenever there is some sort of conversion (from one
encoding to another).

I am trying to find some examples online (or in the php.net site) and
explanation of its usage.

> Also, now that (new) Web pages are nearly all UTF-8 encoded, HTML entities
> are no longer used

I'm totally aware of UTF-8 and encodings; I'm responsible for fixing an
incredible amount of encoding related bugs in the site.

But I'm working on a fairly large site, that sends the info to the browser in
ISO-8859-1 (Can you feel my pain now??? :P ) and there's no way I can change
that (at least not in one go). Legacy code and all that stuff. I cannot go to
my boss and tell him "We need to change the whole encoding of the site" when
all that he wants is this bug fixed, pronto.

I found a bug where some chars (Hungarian, in this case) are not properly
shown in the page: ę >> &#amp;#281;

To add insult to injury, the strings are being truncated in some cases, so you
could end up with Ksi&#amp;#2...

Finally, I don't think it should be someone else's 2 minutes: I was just
asking if somebody knows the answer to that.

I hope this clarifies the matter, feel free to ask anything else that you
think would help (I reckon I didn't put much information up front).

Also, I apologize if some of my sentences are difficult to understand; I'm not
a native English speaker.

~~~
lutusp
> I am trying to find some examples online (or in the php.net site) and
> explanation of its usage.

There are any number of examples of entities, and lists of them. They are a
terrible hassle because not all browsers understand the same ones or treat
them the same.

> I'm responsible for fixing an incredible amount of encoding related bugs in
> the site.

That should be fun. :) There will be times then you won't be able to decide
whether you're fixing an error or introducing one.

> I found a bug where some chars (Hungarian, in this case) are not properly
> shown in the page: ę >> &#amp;#281;

The reason should be obvious -- the original code needed to be preserved
unchanged, but a post-processor escaped the ampersand -- and incorrectly as
well. I wish there were some fast and easy rules, preferably scriptable, but
the examples you show are too varied, as though there was more than one cook
in the kitchen (an English idiom).

I still think you should simply take out entities wherever you can and use
ordinary Unicode characters. That also solves the problem of figuring out what
prior editors had in mind -- assuming the resulting spelling is unambiguous.
But you can also write regular expressions to solve most of the syntactically
correct cases, including:

&(string);

\-- and --

&#(number);

The first is obviously more difficult because you have to create an
associative array (what Python calls a dictionary) to do the translations. The
second case is easier, and I have seen example where the enclosed number was a
normal Unicode code point, or a sequence of two.

Here is a big list of entities:

[http://dev.w3.org/html5/html-author/charref](http://dev.w3.org/html5/html-
author/charref)

If you hover over each entry, the equivalent Unicode is given, so it seems
multiple forms are embedded in the page. You could scrape the page and create
a master list / translation table.

The final problem is that you will need to establish which encoding each page
has, and don't mix encodings. From your comments, some pages are UTF-8 and
some ISO-8859-1, and those two are obviously incompatible.

> Also, I apologize if some of my sentences are difficult to understand; I'm
> not a native English speaker.

As usual in cases like this (in my experience), your prose is better than that
of many native English speakers.

Sok szerencsét!

~~~
ericol
Thanks :)

