Skip to main content

This site requires you to update your browser. Your browsing experience maybe affected by not having the most up to date version.

All other Modules /

Discuss all other Modules here.

Moderators: martimiz, Sean, biapar, Willr, Ingo, swaiba, simon_w

Solr search and indexing files


Go to End
Reply


26 Posts   3931 Views

Avatar
tgfisher

Community Member, 10 Posts

13 May 2011 at 4:26am

Edited: 13/05/2011 4:29am

We're running SilverStripe on Windows and we've got a lot of PDFs, Word Documents, PowerPoint Presentations, etc. that we NEED to be searchable via SilverStripe. We're looking at the SilverStripe Solr module, but we've noticed there isn't support for that yet for the module (or at least from what I can tell via the module's issue page - https://github.com/nyeholt/silverstripe-solr/issues/1).

I know Solr has the ability to handle these files out of the box. What needs to be done on the module side to have it work? I would be interested in helping build this out (as it is a pre-requisite for the search functionality of our site). Any feedback would be greatly appreciated ::cough:: Marcus? ::cough::

:)

Avatar
Marcus

Administrator, 87 Posts

16 May 2011 at 1:46pm

Howdy - there's a couple of ways of approaching it. One is to convert files to a text representation and sending that through to be indexed. The other option is to use Solr's native file indexing mechanisms, but a) I haven't done any research into it to know how well it works and b) I'm not sure if the PHP library I'm using supports file attachments to documents. Unfortunately I don't have any time at the moment to look into things - and probably won't for a month or two at least! Any investigation you're able to do I'm happy to provide answers to questions along the way.

Avatar
tgfisher

Community Member, 10 Posts

18 May 2011 at 1:32am

Marcus, thanks for the response.

Which PHP library are you referring to? I haven't dug into the code yet, but if you give me the name, I'll keep an eye out for it.

We used to use Plone for our previous CMS, and it's approach was similar to your first suggestion. It used "wv", "pdftotext", and some other utilities to extract text from the files, and then indexed from there. It seemed to work well, but I'm not sure what search platform Plone uses.

Is Apache Tika the native file indexing mechanisms you are talking about? I saw it is built into Solr specifically for "Rich Document Parsing and Indexing (PDF, Word, HTML, etc)".

It may be worth some time trying to work with Solr than to try and re-invent the wheel by using other utilities. I'll dig in and see what I can do on my own. I'm sure I'll be tapping you for help as I go along.

Avatar
Marcus

Administrator, 87 Posts

18 May 2011 at 10:46am

I was actually referring to the Solr PHP client library :) http://code.google.com/p/solr-php-client/

Avatar
tgfisher

Community Member, 10 Posts

26 May 2011 at 7:16am

Marcus, I got the module up and running on Windows, but I'm having a hard time getting Solr to index (and return queries) on additional search fields. Here's what I've got in _config.php for my site:

SiteTree::add_extension('SiteTree', 'MySiteTree');
DataObject::add_extension('SiteTree', 'SolrIndexable');
Object::add_extension('Page_Controller', 'SolrSearchExtension');
DataObject::add_extension('DataObject', 'MyDataObject');

And here's what I have for the "MySiteTree" decorator:

class MySiteTree extends SiteTreeDecorator {
   
   function extraStatics($class = null) {
      return array(
         'searchable_fields' => array(
            'Title',
            'Content',
            'MetaDescription',
            'MetaKeywords',
         )
      );
   }
   ...

I'm fairly new to SilverStripe, so I feel like I may be missing something obvious. Any ideas on how I can get MetaDescription and MetaKeywords integrated into search?

Avatar
Marcus

Administrator, 87 Posts

27 May 2011 at 6:47pm

I've done a quick update to the module which adds a bit of stuff that will help you out a bit, and have updated the doco at https://github.com/nyeholt/silverstripe-solr/wiki/Usage-overview. Basically, you'll need to add the searchable_fields changes via updateSearchableFields in your extension (or, just modify the static variable directly). You'll also need to add it via SolrSearchService::add_default_query_field('MetaDescription_t'); and SolrSearchService::add_default_query_field('MetaKeywords_ms');

Be aware of the note about the difference between _t and _ms fields; one is tokenised first, the other isn't, meaning you need to explicitly add * for partial matches in your results.

Avatar
BlueO

Community Member, 52 Posts

22 March 2012 at 5:47pm

Hey I'm also running into a few issues with this and was hoping to get a pointer or two,

I've got the module setup and it is indexing my pages just fine but won't index a dataobject 'Entry',
i've put in

DataObject::add_extension('Entry', 'SolrIndexable');

but no joy,

ideas?

cheers

Bernard

Avatar
Marcus

Administrator, 87 Posts

22 March 2012 at 6:09pm

Are you getting any errors being spat out at all? I've got a project where I'm indexing standalone non-SiteTree data objects and it works fine

Go to Top