non-ASCII characters in UTF-8

Permalink
varibles from phpmyadmin

show variables like 'character%';

character_set_client utf8
character_set_connection utf8
character_set_database utf8
character_set_filesystem binary
character_set_results utf8
character_set_server latin1
character_set_system utf8
character_sets_dir /usr/share/mysql/charsets/

SHOW VARIABLES LIKE 'collation%'

collation_connection utf8_general_ci
collation_database utf8_general_ci
collation_server latin1_general_ci

On page:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

When importing with phpMyAdmin Polish non-ASCII characters get imported correctly.

When displaying non-ASCII chars in concrete5 they are replaced with question-marks � Why ?

When writing to database through Concrete5 (editing document etc.) non-ASCII chars are replace in database by "Ä???&oacute;Ä etc." Why?

When tables in database are set to latin2_general_ci, non-ASCII chars display correctly on site ?
Also non-ASCII chars are replace to unicode in database? (&#243; etc) ?

I can't get UTF-8 encoded non-ASCII characters. All I get are question-marks.
Any ideas ?

Concrete5 version 5.1.0

 
katz515 replied on at Permalink Reply
katz515
OK here is a quick solution that I found.

My c5 is 5.2.0.

First you have to edit the following file

/concrete/libraries/3rdparty/adodb/drivers/adodb-mysql.inc.php

on Line 371 - add the following code

mysql_query( "SET NAMES utf8");

Add the line above to before the following code
if ($this->_connectionID === false) return false;


But I don't know if this is a good idea.

Also once you do this, you MUST set your database collation to utf-8 related ones.
pdm replied on at Permalink Reply
This work and this is a good idea ;) Only emails don't display non-ASCII chars but i change sendMail() function in following file /concrete/helpers/mail.php on line 118 to:

public function sendMail() {
      $from = $this->generateEmailStrings($this->from);
      $to = $this->generateEmailStrings($this->to);
      if (ENABLE_EMAILS) {
         $naglowki  = "MIME-Version: 1.0\r\n";
         $naglowki = "Content-type: text; charset=utf-8\r\n";
         $naglowki .= "From: Formularz_www<".$from.">";      
         mail($to, $this->subject, $this->body, $naglowki);
      }


and all works.
katz515 replied on at Permalink Reply
katz515
Cool. I'll give it a try.

Now there is two probles left...

1. Questions for Forms.

I think c5 uses AJAX to send INSERT query and stuff... Since it only handles ASCII code, the multi-bytes characters becomes corruputed.

2. Search

Search cannot search UTF-8 characters yet.


But solving email was really good.

Thanks a lot! You made my day.
Remo replied on at Permalink Reply
Remo
I fixed the lucene problem, there's a discussion where I attached a very simple patch somewhere.

Andrew already merged it, I guess it will be part of the next c5 update.

C5 does use a few AJAX calls, but not only and even if it does, that's not the problem. Even with AJAX, it's possible to insert utf-8 characters, this works fine..
I only have a few characters that can't be saved (öäü), I therefore don't see the problem as often as you do - have you created a list with blocks, functions that don't handle utf-8 properly?
katz515 replied on at Permalink Reply
katz515
Oh, ok.

I kinda remember your lucene post.
I'll find it.

And about AJAX (specifically form block), it doesn't work for our side. And it works if I directly insert the Japanese into MySQL... And I could insert the Japanese letter during the installation. So I'm guessing it's AJAX that messing multi-bytes characters.

I posted the list of the problems at
http://www.concrete5.org/community/forums/internationalization/mult...

I didn't bother to submit this as a bug since this is not a big deal for the most of people... only East Asian languages which uses multi-bytes characters. (Or should I?)

Anyway, when I'm done with everything, I'll submit the code to Andrew
Remo replied on at Permalink Reply
Remo
Sure there's a problem but it's not related to AJAX but rather to a problem in the c5 code.. This is the only thing I wanted to say..

It is actually a problem since the German language also uses a few multi byte letters. Not as many as you need though ;-)

Same with most languages in Europe, they mostly have a few "strange letters" too.

I'm probably going to work on that issue later that week..
Remo replied on at Permalink Reply
Remo
You probably wanted to write .= and not only =?

public function sendMail() {
        $from = $this->generateEmailStrings($this->from);
        $to = $this->generateEmailStrings($this->to);
        if (ENABLE_EMAILS) {
            $naglowki  = "MIME-Version: 1.0\r\n";
            $naglowki .= "Content-type: text; charset=utf-8\r\n";
            $naglowki .= "From: Formularz_www<".$from.">";        
            mail($to, $this->subject, $this->body, $naglowki);
        }


otherwise MIME-Version doesn't find its way into the header..
Remo replied on at Permalink Reply 1 Attachment
Remo
the attached patch works for me.

I can now use german umlauts for the labels in the form and I can submit data by mail too. Only tested with outlook and gmail webclient.

I'm not sure about htmlentities. This might cause troubles too...
katz515 replied on at Permalink Reply 1 Attachment
katz515
It didn't work for Japanese.... I think I need to use

It does work for email body, but not to the subject line....


But you gave me some idea... i'll follow you guys up.
pdm replied on at Permalink Reply
the body of email display polish non-ASCII chars, title not :/
katz515 replied on at Permalink Reply
katz515
I didn't carefully checked your zip file.

I'll give it a try this as well.

Thanks!
Remo replied on at Permalink Reply
Remo
It shows you where to problem occurs but it's not the real source of it.

What you probably have to check is this file:
/concrete/blocks/form/auto.js

The method addQuestion contains a few calls "escape". This is a bit dangerous since it also escapes characters like the Japanese full-width characters (thanks Katz for the lesson :-)

I don't have a completely tested patch yet but try working with "escape" (removing it for example) and it should look better..
Remo replied on at Permalink Reply
Remo
I've talked to Katz for a while and tried to fix a few of his problems.

There are a few things I've learnt which I'd like to share:

1. Using the JavaScript method "escape" causes troubles since it escapes all the full width characters too!

2. The "standard" string functions also cause a few problems. For example - concrete/helpers/text.php. shortText contains a call to "substr". However, this method might truncate a full width character in the middle of it! using mb_substr however is safe!

function shortText($textStr, $numChars=255, $tail='...'){
      if(intval($numChars)==0)$numChars=150;
      $textStr=strip_tags($textStr);
      if (strlen($textStr)>intval($numChars)){ 
         $textStr= mb_substr($textStr,0,$numChars,'utf-8').$tail;
      }
      return $textStr;            
   }
andrew replied on at Permalink Reply
andrew
Just wanting to let everyone know that we're taking this very seriously, and have instituted a lot of fixes for these issues in svn, in development/5.3.0 branch (although they may make it out before that.)
matchy replied on at Permalink Reply 1 Attachment
matchy
a patch for current svn repository (rev.693)

Content-Type: text -> text/plain
base64 encoding for Subject
andrew replied on at Permalink Reply
andrew
Added this patch to trunk.
matchy replied on at Permalink Reply
matchy
By the way, Is svn URI of C5 open to the public?
I found it by chance.
frz replied on at Permalink Reply
frz
if you'd like to be on the beta team, you should let me know.

-frz
matchy replied on at Permalink Reply
matchy
How should I do?
matchy replied on at Permalink Reply 1 Attachment
matchy
Japanese <title> is garbled on IE.
'<meta http-equiv="content-type"' should be before '<title>'.

Attached a patch for current svn trunk (rev.712)
matchy replied on at Permalink Reply
matchy
Should I have posted to Beta Bugs?
if so sorry.
andrew replied on at Permalink Reply
andrew
this should be fixed in subversion.
harunkaraman replied on at Permalink Reply 1 Attachment
harunkaraman
I had the same problem, and fixed that.

Run this script for your database, and you will see it fixed. The problem is, most likely you had your default charset set to .. swedish_ci. Later changing it to UTF-8 does not help. Because, C5 already created tenth of new tables which are not utf8 and you keep on adding to it. Sometimes, host providers does not watch the default charsets during upgrade.

After backing up your database, copy the content of below php code in a file. I also attached the file, rename the extension to PHP and run it.

Edit the database settings,

Run it with your browser.

Done!

=== PHP CODE BEGINS ===

<?php
$host=' '; //this is the database hostname, Do not change this.
$user=' '; //please set your mysql user name
$pass=' '; // please set your mysql user password
$dbname=' '; //please set your Database name
$charset='utf8'; // specify the character set
$collation='utf8_general_ci'; //specify what collation you wish to use

$db = mysql_connect('localhost',"$user","$pass") or die("mysql could not CONNECT to the database, in correct user or password " . mysql_error());
mysql_select_db("$dbname") or die("Mysql could not SELECT to the database, Please check your database name " . mysql_error());
$result=mysql_query('show tables') or die("Mysql could not execute the command 'show tables' " . mysql_error());
while($tables = mysql_fetch_array($result)) {
foreach ($tables as $key => $value) {
mysql_query("ALTER TABLE $value CONVERT TO CHARACTER SET $charset COLLATE $collation") or die("Could not convert the table " . mysql_error());
}}
mysql_query("ALTER DATABASE $dbname DEFAULT CHARACTER SET $charset COLLATE $collation") or die("could not alter the collation of the databse " . mysql_error());
echo "The collation of your database has been successfully changed!";
?>

=== PHP CODE END ===


You may contact us if you have any problem
http://www.kordil.com
fr0z3nk0 replied on at Permalink Reply
nice one :)

or just print out all queries and run it because script can time out :)

<?php
$host='localhost'; //this is the database hostname, Do not change this.
$user=''; //please set your mysql user name
$pass=''; // please set your mysql user password
$dbname=''; //please set your Database name
$charset='utf8'; // specify the character set
$collation='utf8_general_ci'; //specify what collation you wish to use

$db = mysql_connect('localhost',"$user","$pass") or die("mysql could not CONNECT to the database, in correct user or password " . mysql_error());
mysql_select_db("$dbname") or die("Mysql could not SELECT to the database, Please check your database name " . mysql_error());
$result=mysql_query('show tables') or die("Mysql could not execute the command 'show tables' " . mysql_error());
while($tables = mysql_fetch_array($result)) {
foreach ($tables as $key => $value) {
print "ALTER TABLE $value CONVERT TO CHARACTER SET $charset COLLATE $collation;" ."<br />";
//mysql_query("ALTER TABLE $value CONVERT TO CHARACTER SET $charset COLLATE $collation") or die("Could not convert the table " . mysql_error());
}}
print "ALTER DATABASE $dbname DEFAULT CHARACTER SET $charset COLLATE $collation;" ."<br />";
//mysql_query("ALTER DATABASE $dbname DEFAULT CHARACTER SET $charset COLLATE $collation") or die("could not alter the collation of the databse " . mysql_error());
//echo "The collation of your database has been successfully changed!";
?>