Skip to main content

This site requires you to update your browser. Your browsing experience maybe affected by not having the most up to date version.

Migrating a Site to Silverstripe

What you need to know when migrating your existing site to SilverStripe.

Moderators: martimiz, Sean, biapar, Willr, Ingo, swaiba, simon_w

UTF-8 problem when migrating data


Go to End
Reply

20 Posts   11122 Views

Avatar
Fanta

14 March 2009 at 8:42am (Last edited: 15 March 2009 6:24am), Community Member, 7 Posts

Hello,

I just created my first silverstripe website with my own template - and I like silverstripe!
Now I'm trying to migrate the content from my old joomla site. I exported the content into CSV-file and converted it to the structure of SiteTree-table. Then I imported the CSV into my silverstripe database. All fine ... but now I am in trouble with the charset.

- CSV-file is UTF-8
- mySQL database is UTF-8
- _config.php includes: ContentNegotiator::set_encoding('utf-8');
- my page.ss includes <meta http-equiv="Content-type" content="text/html; charset=utf-8" >
- I even checked the database content by SELECT HEX(TITLE) FROM SiteTree

But the german umlaut characters (i.e. ä = U+00E4 = UTF-8 c3 a4) is still not displayed, neither in CMS nor in in frontend.
The Content (HTML-field) is ok, but Title and MenuTitle is wrong.

Then I tried to enter an "ä" into CMS and I was very astonished that silverstripe coded it as hex C3 83 C2 A4 into database.
I don't think that this is UTF-8 !

Why doesn't Silverstripe use UTF-8 ? I'm very confused.

Thanks in advance
Fanta

Avatar
Ingo

15 March 2009 at 7:24am Forum Moderator, 801 Posts

Hm, we don't usually have problems with this, SilverStripe handles UTF8 fine - whats the collation of your database fields?

Avatar
Fanta

15 March 2009 at 11:16pm (Last edited: 15 March 2009 11:34pm), Community Member, 7 Posts

The collation is "utf8_unicode_ci" and i also tried "utf8_general_ci".

Avatar
Fanta

16 March 2009 at 6:17am (Last edited: 18 March 2009 10:31am), Community Member, 7 Posts

I made some experiments on three installations:
1. local wampp under Windows XP
2. ubuntu on a virtual system
3. debian server of my host-provider
In every case I created a page in CMS with ä ö ü Ä Ö Ü characters in the title.
But there is no UTF-8 in any database. The umlaut-characters are always coded in 4 bytes instead of 2 bytes. Is it UTF-32?
I cannot believe that SilverStripe uses UTF-8 at all.

It is no problem if you only create pages in CMS. But I would like to know what charset SilverSripe really uses. It would make migrations easier.

Avatar
Ingo

20 March 2009 at 10:35am Forum Moderator, 801 Posts

Uhm, I really can't follow you. Just tried a run of the mill 2.3 installation with umlauts in Title, MenuTitle and Content (all entered through CMS), they all render fine in a standard blackcandy theme with content-type delivered as utf8.
Try it yourself on demo.silverstripe.com - if you're quick (next 30 min?), you can see my umlaut test page on there: http://demo.silverstripe.com/umlaut-test/.

This is how silverstripe sets the collation for SiteTree->Content: mediumtext character set utf8 collate utf8_general_ci

Here's the utf-8 setting which every site gets through the bootstrapper - admittedly its harder *not* to use utf8 than it is to have it working out of the box ;) http://open.silverstripe.com/browser/modules/sapphire/branches/2.3/main.php#L61

Avatar
Fanta

20 March 2009 at 11:06pm Community Member, 7 Posts

Hello Ingo,

thank you for your reply. I agree with you, that there is no problem with umlaut-characters, if you enter it in CMS. There are displayed both on CMS and frontend very well.

If you have a look on your MySQL database you will see, that the characters are not stored as utf-8 in SiteTree-table.
I.e. make SELECT HEX(TITLE) WHERE ID = ...
At least on my three installations the umlauts where codes in 4 bytes i.e. ä = C383C2A4. It isn't utf-8!
Could you please try a SQL-Query on your demo-database to check it?

Best regards,
Fanta

Avatar
Ingo

20 March 2009 at 11:54pm Forum Moderator, 801 Posts

Hey Fanta, I've roughly traced through the execution of saving a dataobject through to the database column, can't find anything suspicious that would deviate from our UTF-8 defaults. We need to get better about using multibyte safe PHP5 functionality though, I've opened a new ticket about this: http://open.silverstripe.com/ticket/3746 - not sure if its related to your problem.
Any help in this area is appreciated (both in-depth testing and patches).

I found PHPWACT quite helpful to figure out UTF-8 in PHP: http://www.phpwact.org/php/i18n/utf-8
Unfortunately you can't expect PHP to do the "right thing" by default apparently... the list on this page is fairly frightening

Not sure why and how you got UTF-16 or even 32 characters in your db - can you give more details on your server setup, system locale, php locale, php.ini and mysql conf? :)

Avatar
Fanta

26 March 2009 at 9:56am Community Member, 7 Posts

Hello Ingo,

now I understand how the characters are coded:
i.e. ä-umlaut is ASCII hex E4 -> converted to UTF-8 hex C3 A4
If you interpret the two bytes again as ASCII, you can convert them again:
ASCII hex C3 (Ã) -> UTF-8 hex C3 A3
ASCII hex A4 (¤) -> UTF-8 hex C2 A4

These are the 4 bytes (C3 A3 C2 A4) I found in my database. Silverstripe makes the ASCII->UTF-8 conversion two times!

Unfortunatly I'm not strong enough in PHP-programming, but maybe somebody can find the bug by this hint.

The main problem is, that if you change the program, all existing SilverStripe-installations will get problem with their content :-(

Best regards,
Fanta

Go to Top