Skip to main content

This site requires you to update your browser. Your browsing experience maybe affected by not having the most up to date version.

All other Modules /

Discuss all other Modules here.

Moderators: martimiz, Sean, biapar, Willr, Ingo, swaiba, simon_w

Lucene module - search including PDF, Word and Excel


Reply


18 Posts   5269 Views

Avatar
schellmax

Community Member, 126 Posts

8 March 2011 at 4:26am

hey, just found indexing relations is now possible. http://code.google.com/p/lucene-silverstripe-plugin/source/detail?r=29
i'll definitely give this a try in my next silverstripe project.
thanks for all the work!

Avatar
Darren Inwood

Community Member, 12 Posts

8 March 2011 at 9:02am

Cheers for your interest!

I've just released a new version incorporating some ideas from users. It indexes files a LOT quicker, and can index older Word/Excel documents too. If you're playing with it at the moment, I'd suggest grabbing the 0.3.3 release. =]

Avatar
hippo1

Community Member, 5 Posts

22 June 2011 at 12:46pm

Darren - does version 0.3.3 only work with older Microsoft Word/Excel documents (ie Office 97)?
As it seems my install of SilverStripe (2.4.4) returns search results from .docx and .xlsx but not .doc and .xls.

Avatar
Darren Inwood

Community Member, 12 Posts

22 June 2011 at 1:29pm

Hi hippo1,

You actually need the 'zip' extension for PHP loaded for scanning xslx/docx/pptx documents to work. To check if you have this extension, visit a phpinfo() page and see if there is a section under 'Configuration' for 'zip'.

If you're on a debian-based host (most hosting vendors use debian) the zip extension is installed by default, if you're on WAMP/MAMP you may need to go to 'PHP Extensions' and enable the php_zip option.

Hope this helps! :-)
--
Darren

Avatar
hippo1

Community Member, 5 Posts

22 June 2011 at 1:54pm

Thanks Darren, but as explained I can actually scan those documents (.docx, .xlsx) you have mentioned however it wont scan .doc or .xls documents.
So it seems to me, being rather new to al of this so certainly can be wrong, that the Lucene 0.3.3 module only searches for Office 97 documents (.docx and .xlsx) but not the later versions (.doc and .xls). I was looking at seeing how hard it would be to modify the code to include .doc and .xls documents but thought would ask here first if they should be supoprted and maybe my SilverStripe installation is at fault.
BTW - I do have the PHP Zip module installed (verified by phpinfo), and am running SS 2.4.4 on a Windows server with SQL2008 as the backend.

Avatar
Darren Inwood

Community Member, 12 Posts

22 June 2011 at 2:19pm

Apologies, I'm at work atm so my mind is elsewhere...!

You need to install the commandline 'catdoc' utility suite to enable scanning doc/xsl/ppt documents. This isn't in the documentation which is a bit of an oversight!

If you're on debian/ubuntu you can do apt-get install catdoc, if you're on another *nix or Mac OS X you can use your own package management system or will possibly need to compile them from source:
http://www.wagner.pp.ru/~vitus/software/catdoc/

If you are using Windows then sorry but you'll need to do some coding :-( in ZendSearchLuceneWrapper::index() function, there are some lines like this that won't work on Windows:
$catdoc = trim(shell_exec('which catdoc'));

You'll need to replace them with either hardcoded file paths to wherever the catdoc utilities are installed, or set up some sort of config system yourself.

Hopefully I can find time to get this going on Windows soon so I can release the current SVN trunk, where all this is fixed already!

Hope that's a better answer =]

Avatar
hippo1

Community Member, 5 Posts

22 June 2011 at 6:51pm

Many thanks for quick response!
I might have a crack at getting this to work through some coding (windows), as suggested, if even to get to become better acquainted with SilverStripe a little more :)

Cheers.

Avatar
nicolant

Community Member, 6 Posts

25 June 2011 at 5:24am

First, thanks for a great module.

I'm testing it on shared hosting, which has a limitation of 64 open files. As a result every time I rebuild index or try to index page after that I get "Too many files open" error. Is there a way to reduce number of open files?