Skip to main content

This site requires you to update your browser. Your browsing experience maybe affected by not having the most up to date version.

We're retiring the forums!

The SilverStripe forums have passed their heyday. They'll stick around, but will be read only. We'd encourage you to get involved in the community via the following channels instead:

Migrating a Site to Silverstripe /

What you need to know when migrating your existing site to SilverStripe.

Moderators: martimiz, Sean, Ed, biapar, Willr, Ingo, swaiba

UTF-8 problem when migrating data


Go to End


20 Posts   17727 Views

Avatar
Fanta

Community Member, 7 Posts

14 March 2009 at 8:42am

Edited: 15/03/2009 6:24am

Hello,

I just created my first silverstripe website with my own template - and I like silverstripe!
Now I'm trying to migrate the content from my old joomla site. I exported the content into CSV-file and converted it to the structure of SiteTree-table. Then I imported the CSV into my silverstripe database. All fine ... but now I am in trouble with the charset.

- CSV-file is UTF-8
- mySQL database is UTF-8
- _config.php includes: ContentNegotiator::set_encoding('utf-8');
- my page.ss includes <meta http-equiv="Content-type" content="text/html; charset=utf-8" >
- I even checked the database content by SELECT HEX(TITLE) FROM SiteTree

But the german umlaut characters (i.e. ä = U+00E4 = UTF-8 c3 a4) is still not displayed, neither in CMS nor in in frontend.
The Content (HTML-field) is ok, but Title and MenuTitle is wrong.

Then I tried to enter an "ä" into CMS and I was very astonished that silverstripe coded it as hex C3 83 C2 A4 into database.
I don't think that this is UTF-8 !

Why doesn't Silverstripe use UTF-8 ? I'm very confused.

Thanks in advance
Fanta

Avatar
Ingo

Forum Moderator, 801 Posts

15 March 2009 at 7:24am

Hm, we don't usually have problems with this, SilverStripe handles UTF8 fine - whats the collation of your database fields?

Avatar
Fanta

Community Member, 7 Posts

15 March 2009 at 11:16pm

Edited: 15/03/2009 11:34pm

The collation is "utf8_unicode_ci" and i also tried "utf8_general_ci".

Avatar
Fanta

Community Member, 7 Posts

16 March 2009 at 6:17am

Edited: 18/03/2009 10:31am

I made some experiments on three installations:
1. local wampp under Windows XP
2. ubuntu on a virtual system
3. debian server of my host-provider
In every case I created a page in CMS with ä ö ü Ä Ö Ü characters in the title.
But there is no UTF-8 in any database. The umlaut-characters are always coded in 4 bytes instead of 2 bytes. Is it UTF-32?
I cannot believe that SilverStripe uses UTF-8 at all.

It is no problem if you only create pages in CMS. But I would like to know what charset SilverSripe really uses. It would make migrations easier.

Avatar
Ingo

Forum Moderator, 801 Posts

20 March 2009 at 10:35am

Uhm, I really can't follow you. Just tried a run of the mill 2.3 installation with umlauts in Title, MenuTitle and Content (all entered through CMS), they all render fine in a standard blackcandy theme with content-type delivered as utf8.
Try it yourself on demo.silverstripe.com - if you're quick (next 30 min?), you can see my umlaut test page on there: http://demo.silverstripe.com/umlaut-test/.

This is how silverstripe sets the collation for SiteTree->Content: mediumtext character set utf8 collate utf8_general_ci

Here's the utf-8 setting which every site gets through the bootstrapper - admittedly its harder *not* to use utf8 than it is to have it working out of the box ;) http://open.silverstripe.com/browser/modules/sapphire/branches/2.3/main.php#L61

Avatar
Fanta

Community Member, 7 Posts

20 March 2009 at 11:06pm

Hello Ingo,

thank you for your reply. I agree with you, that there is no problem with umlaut-characters, if you enter it in CMS. There are displayed both on CMS and frontend very well.

If you have a look on your MySQL database you will see, that the characters are not stored as utf-8 in SiteTree-table.
I.e. make SELECT HEX(TITLE) WHERE ID = ...
At least on my three installations the umlauts where codes in 4 bytes i.e. ä = C383C2A4. It isn't utf-8!
Could you please try a SQL-Query on your demo-database to check it?

Best regards,
Fanta

Avatar
Ingo

Forum Moderator, 801 Posts

20 March 2009 at 11:54pm

Hey Fanta, I've roughly traced through the execution of saving a dataobject through to the database column, can't find anything suspicious that would deviate from our UTF-8 defaults. We need to get better about using multibyte safe PHP5 functionality though, I've opened a new ticket about this: http://open.silverstripe.com/ticket/3746 - not sure if its related to your problem.
Any help in this area is appreciated (both in-depth testing and patches).

I found PHPWACT quite helpful to figure out UTF-8 in PHP: http://www.phpwact.org/php/i18n/utf-8
Unfortunately you can't expect PHP to do the "right thing" by default apparently... the list on this page is fairly frightening

Not sure why and how you got UTF-16 or even 32 characters in your db - can you give more details on your server setup, system locale, php locale, php.ini and mysql conf? :)

Avatar
Fanta

Community Member, 7 Posts

26 March 2009 at 9:56am

Hello Ingo,

now I understand how the characters are coded:
i.e. ä-umlaut is ASCII hex E4 -> converted to UTF-8 hex C3 A4
If you interpret the two bytes again as ASCII, you can convert them again:
ASCII hex C3 (Ã) -> UTF-8 hex C3 A3
ASCII hex A4 (¤) -> UTF-8 hex C2 A4

These are the 4 bytes (C3 A3 C2 A4) I found in my database. Silverstripe makes the ASCII->UTF-8 conversion two times!

Unfortunatly I'm not strong enough in PHP-programming, but maybe somebody can find the bug by this hint.

The main problem is, that if you change the program, all existing SilverStripe-installations will get problem with their content :-(

Best regards,
Fanta

Go to Top