14 March 2009 at 8:42am
(Last edited: 15 March 2009 6:24am),
I just created my first silverstripe website with my own template - and I like silverstripe!
Now I'm trying to migrate the content from my old joomla site. I exported the content into CSV-file and converted it to the structure of SiteTree-table. Then I imported the CSV into my silverstripe database. All fine ... but now I am in trouble with the charset.
- CSV-file is UTF-8
- mySQL database is UTF-8
- _config.php includes: ContentNegotiator::set_encoding('utf-8');
- my page.ss includes <meta http-equiv="Content-type" content="text/html; charset=utf-8" >
- I even checked the database content by SELECT HEX(TITLE) FROM SiteTree
But the german umlaut characters (i.e. Ã¤ = U+00E4 = UTF-8 c3 a4) is still not displayed, neither in CMS nor in in frontend.
The Content (HTML-field) is ok, but Title and MenuTitle is wrong.
Then I tried to enter an "Ã¤" into CMS and I was very astonished that silverstripe coded it as hex C3 83 C2 A4 into database.
I don't think that this is UTF-8 !
Why doesn't Silverstripe use UTF-8 ? I'm very confused.
16 March 2009 at 6:17am
(Last edited: 18 March 2009 10:31am),
I made some experiments on three installations:
1. local wampp under Windows XP
2. ubuntu on a virtual system
3. debian server of my host-provider
In every case I created a page in CMS with Ã¤ Ã¶ Ã¼ Ã„ Ã– Ãœ characters in the title.
But there is no UTF-8 in any database. The umlaut-characters are always coded in 4 bytes instead of 2 bytes. Is it UTF-32?
I cannot believe that SilverStripe uses UTF-8 at all.
It is no problem if you only create pages in CMS. But I would like to know what charset SilverSripe really uses. It would make migrations easier.
20 March 2009 at 10:35am
Uhm, I really can't follow you. Just tried a run of the mill 2.3 installation with umlauts in Title, MenuTitle and Content (all entered through CMS), they all render fine in a standard blackcandy theme with content-type delivered as utf8.
Try it yourself on demo.silverstripe.com - if you're quick (next 30 min?), you can see my umlaut test page on there: http://demo.silverstripe.com/umlaut-test/.
This is how silverstripe sets the collation for SiteTree->Content: mediumtext character set utf8 collate utf8_general_ci
Here's the utf-8 setting which every site gets through the bootstrapper - admittedly its harder *not* to use utf8 than it is to have it working out of the box ;) http://open.silverstripe.com/browser/modules/sapphire/branches/2.3/main.php#L61
20 March 2009 at 11:06pm
thank you for your reply. I agree with you, that there is no problem with umlaut-characters, if you enter it in CMS. There are displayed both on CMS and frontend very well.
If you have a look on your MySQL database you will see, that the characters are not stored as utf-8 in SiteTree-table.
I.e. make SELECT HEX(TITLE) WHERE ID = ...
At least on my three installations the umlauts where codes in 4 bytes i.e. Ã¤ = C383C2A4. It isn't utf-8!
Could you please try a SQL-Query on your demo-database to check it?
20 March 2009 at 11:54pm
Hey Fanta, I've roughly traced through the execution of saving a dataobject through to the database column, can't find anything suspicious that would deviate from our UTF-8 defaults. We need to get better about using multibyte safe PHP5 functionality though, I've opened a new ticket about this: http://open.silverstripe.com/ticket/3746 - not sure if its related to your problem.
Any help in this area is appreciated (both in-depth testing and patches).
I found PHPWACT quite helpful to figure out UTF-8 in PHP: http://www.phpwact.org/php/i18n/utf-8
Unfortunately you can't expect PHP to do the "right thing" by default apparently... the list on this page is fairly frightening
Not sure why and how you got UTF-16 or even 32 characters in your db - can you give more details on your server setup, system locale, php locale, php.ini and mysql conf? :)
now I understand how the characters are coded:
i.e. Ã¤-umlaut is ASCII hex E4 -> converted to UTF-8 hex C3 A4
If you interpret the two bytes again as ASCII, you can convert them again:
ASCII hex C3 (Ãƒ) -> UTF-8 hex C3 A3
ASCII hex A4 (Â¤) -> UTF-8 hex C2 A4
These are the 4 bytes (C3 A3 C2 A4) I found in my database. Silverstripe makes the ASCII->UTF-8 conversion two times!
Unfortunatly I'm not strong enough in PHP-programming, but maybe somebody can find the bug by this hint.
The main problem is, that if you change the program, all existing SilverStripe-installations will get problem with their content :-(