Jump to:

10394 Posts in 2203 Topics by 1714 members

All other Modules

SilverStripe Forums » All other Modules » Lucene module - search including PDF, Word and Excel

Discuss all other Modules here.

Moderators: martimiz, Howard, Sean, Ryan M., biapar, Willr, Ingo, swaiba, simon_w

Page: 1 2 3
Go to End
Author Topic: 3573 Views
  • Anonymous user
    Avatar
    Community Member
    1 Post

    Lucene module - search including PDF, Word and Excel Link to this post

    Hi all,

    I've implemented a new Search module for SilverStripe, hope it helps someone out!
    It uses Zend_Search_Lucene, which gives more relevant results than the standard MySQL stuff, and supports indexing Microsoft Word and Excel documents.

    I've used the StandardAnalyzer component by Kenny Katzgrau to tokenise/stem English words.

    It can also index PDFs, using either the pdftotext commandline utility, or falls back to the PDF2Text class by Joeri Stegeman if pdftotext is not installed.

    http://code.google.com/p/lucene-silverstripe-plugin/

    Please leave your comments for improvement.

    Cheers!

  • Howard
    Avatar
    Forum Moderator
    215 Posts

    Re: Lucene module - search including PDF, Word and Excel Link to this post

    Wow this looks impressive I look forward to trying it out in my next project.

  • Willr
    Avatar
    Forum Moderator
    5175 Posts

    Re: Lucene module - search including PDF, Word and Excel Link to this post

    Make sure you have submitted it to the modules section here on silverstripe.org so that people know its around!

  • Mad_Clog
    Avatar
    Community Member
    78 Posts

    Re: Lucene module - search including PDF, Word and Excel Link to this post

    So this doens't have any server dependencies like the Sphinx module?
    From the looks of it it's really easy to setup, use and extend.
    I think the dev team should also look into this as the current search mechanism is a sodding nightmare!

  • Anonymous user
    Avatar
    Community Member
    1 Post

    Re: Lucene module - search including PDF, Word and Excel Link to this post

    It doesn't depend on anything - everything you need to start indexing is in the module folder. =]

    If you have 'pdftotext' installed on your server, it will take advantage of that for better PDF indexing, though - and, one caveat, if you have Zend in your include_path, it may interfere with the Zend stuff that's included.

  • schellmax
    Avatar
    Community Member
    126 Posts

    Re: Lucene module - search including PDF, Word and Excel Link to this post

    this looks really promising.
    i agree with Mad_Clog that the search functionality that ships with silverstripe is very restricted; also i didn't find the sphinx module overly useful due to the mentioned server dependencies.

    however, the thing that was bugging me most with silverstripe's search method is that it won't look into related fields - from what i've seen this is a feature you've got on your list for future enhancments. would be really happy to hear about it if you can manage to integrate this feature. thumbs up!

  • Mad_Clog
    Avatar
    Community Member
    78 Posts

    Re: Lucene module - search including PDF, Word and Excel Link to this post

    Integrated this into a site last night, had it up and running within 30 minutes.
    It wasn't playing nice with the ZendFramework reference which was already in my include path.
    In the set_include_path (called in _config.php) i stripped out my original ZF path from the get_include_path which is used there.

    Thanks for this module and keep it up

  • Anonymous user
    Avatar
    Community Member
    1 Post

    Re: Lucene module - search including PDF, Word and Excel Link to this post

    I really should do some work so that it will use your own Zend if you've included it already... paid work is coming first at the moment!

    I haven't decided on a way to index relationships just yet. For configuration, I think using dot-notation like other SS relation configs would be a good idea, eg. for a has-one you could go Image.Title, or for has-many or many-many you could have Images.Title.

    Adding OCR for indexing text contained in images is also on the cards.

    3573 Views
Page: 1 2 3
Go to Top

Want to know more about the company that brought you SilverStripe? Then check out SilverStripe.com

Comments on this website? Please give feedback.