<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:dw="https://www.dreamwidth.org">
  <id>tag:dreamwidth.org,2009-04-21:107589</id>
  <title>Tang's DW</title>
  <subtitle>tangaroa</subtitle>
  <author>
    <name>tangaroa</name>
  </author>
  <link rel="alternate" type="text/html" href="https://tangaroa.dreamwidth.org/"/>
  <link rel="self" type="text/xml" href="https://tangaroa.dreamwidth.org/data/atom"/>
  <updated>2013-07-12T16:54:36Z</updated>
  <dw:journal username="tangaroa" type="personal"/>
  <entry>
    <id>tag:dreamwidth.org,2009-04-21:107589:44038</id>
    <link rel="alternate" type="text/html" href="https://tangaroa.dreamwidth.org/44038.html"/>
    <link rel="self" type="text/xml" href="https://tangaroa.dreamwidth.org/data/atom/?itemid=44038"/>
    <title>What is this garbage: Ãƒâ€šÃ‚Â</title>
    <published>2013-07-12T16:52:59Z</published>
    <updated>2013-07-12T16:54:36Z</updated>
    <category term="utf-8"/>
    <category term="computers"/>
    <category term="html"/>
    <category term="webdev"/>
    <category term="unicode"/>
    <dw:security>public</dw:security>
    <dw:reply-count>0</dw:reply-count>
    <content type="html">Seen around the web, a website displays the copyright symbol as: Ãƒâ€šÃ‚Â©&lt;br /&gt;&lt;br /&gt;The Unicode garbage is in the source:&lt;br /&gt;&lt;pre&gt;Ã&amp;amp;#402;â&amp;amp;#8364;&amp;amp;#353;Ã&amp;amp;#8218;Â&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Googling either string shows the same characters added before special characters on other websites. I wonder what causes this.&lt;br /&gt;&lt;br /&gt;Saving the raw HTML + xxd produces: &lt;br /&gt;&lt;br /&gt;&lt;pre&gt;
          c326 2334 3032 3be2 2623 3833      .&amp;amp;#402;.&amp;amp;#83
3634 3b26 2333 3533 3bc3 2623 3832 3138  64;&amp;amp;#353;.&amp;amp;#8218
3bc2 a9                                  ;..&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Converting the HTML escape sequences produces: &lt;br /&gt;&lt;br /&gt;&lt;pre&gt;c3 0192 e2 20ac 0161 82c3 201A c2a9&lt;/pre&gt; &lt;br /&gt;&lt;br /&gt;In which c3 and e2 are one-byte characters, the entities are multibyte Unicode characters, and c2a9 at the end is the &lt;a href="https://en.wikipedia.org/wiki/UTF-8#Description"&gt;UTF-8 representation&lt;/a&gt; of unicode 0xa9, the copyright symbol. Looking up the entity values shows that the garbage string is being produced exactly as specified by the HTML. &lt;br /&gt;&lt;br /&gt;So what's going on here? People are presumably copying and pasting a symbol from some other location, most likely Word or another website, into their web browser or HTML editor. Somehow these extra characters get passed along in a way that they don't notice it and fix it. In the case of the website where I first saw this, it looks like an HTML editor translated the characters into HTML escape sequences; nobody would do that manually. I don't know what causes this or what the characters are supposed to mean.&lt;br /&gt;&lt;br /&gt;&lt;img src="https://www.dreamwidth.org/tools/commentcount?user=tangaroa&amp;ditemid=44038" width="30" height="12" alt="comment count unavailable" style="vertical-align: middle;"/&gt; comments</content>
  </entry>
</feed>
