tangaroa | Entries tagged with utf-8

Seen around the web, a website displays the copyright symbol as: Ãƒâ€šÃ‚Â©

The Unicode garbage is in the source:

Ã&#402;â&#8364;&#353;Ã&#8218;Â

Googling either string shows the same characters added before special characters on other websites. I wonder what causes this.

Saving the raw HTML + xxd produces:

          c326 2334 3032 3be2 2623 3833      .&#402;.&#83
3634 3b26 2333 3533 3bc3 2623 3832 3138  64;&#353;.&#8218
3bc2 a9                                  ;..

Converting the HTML escape sequences produces:

c3 0192 e2 20ac 0161 82c3 201A c2a9

In which c3 and e2 are one-byte characters, the entities are multibyte Unicode characters, and c2a9 at the end is the UTF-8 representation of unicode 0xa9, the copyright symbol. Looking up the entity values shows that the garbage string is being produced exactly as specified by the HTML.

So what's going on here? People are presumably copying and pasting a symbol from some other location, most likely Word or another website, into their web browser or HTML editor. Somehow these extra characters get passed along in a way that they don't notice it and fix it. In the case of the website where I first saw this, it looks like an HTML editor translated the characters into HTML escape sequences; nobody would do that manually. I don't know what causes this or what the characters are supposed to mean.

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

Tang's DW

Entries tagged with utf-8

What is this garbage: Ãƒâ€šÃ‚Â

Profile

Navigation

April 2020

Syndicate

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags