Skip to main content

This site requires you to update your browser. Your browsing experience maybe affected by not having the most up to date version.

We've moved the forum!

Please use forum.silverstripe.org for any new questions (announcement).
The forum archive will stick around, but will be read only.

You can also use our Slack channel or StackOverflow to ask for help.
Check out our community overview for more options to contribute.

General Questions /

General questions about getting started with SilverStripe that don't fit in any of the categories above.

Moderators: martimiz, Sean, Ed, biapar, Willr, Ingo, swaiba

Search - what should I actually expect?


Go to End


6 Posts   1967 Views

Avatar
Junglefish

Community Member, 109 Posts

18 August 2009 at 1:19am

Hi

Am currently evaluating an out-of-the-box (more-or-less) installation of SS, version 2.3.3. The search functionality is returning some rather odd results so I wanted to find out a) what I should expect out of the box and b) what can be achieved by tweaking if necessary.

(Apologies for the slight terseness in the wording below, but I was trying to be as concise as possible while stating what I'm trying to achieve.)

Here's what's currently happening:

PageTypes:
Page
Calendar
CalendarEvent
ArticleHolder
ArticlePage
...all return results. However, any COMMENTS in any of those pagetypes do not get returned.

Question: Is this what I should expect?

Question: How can I get the comments included in the search results?

Binary documents:
PDFs
When the PDF is uploaded, it is not immediately picked up by search.
When I use a re-director pagetype to link directly to a PDF, the PDF text is found in the search results.
When I use a hyperlink in the text of a normal page to simply point at a PDF, the text is NOT found in the search results.

Question: Is this what I should expect?

DOCs
Random Word Document a)
As soon as was uploaded, it was picked up by search, even before adding it to any content.

Random Word Document b)
Is NOT picked up by search, even after using a re-director pagetype to link directly to it.

This seems very odd. Why would the two word docs be being handled so differently?

Question: At what point should DOCs and PDFs be included in search results? Just after upload? After they've been pulled up into the site in some way? If so, in what way?

Question: Is there some process I need to undertake in order to add binary documents to the search index? If not, what possible reasons might there be to explain why one word document is being found while the other one isn't?

All advice gratefully received...

Avatar
Junglefish

Community Member, 109 Posts

18 August 2009 at 8:31pm

More info...

It appears I am mistaken on a couple of things. My 'inconsistencies' were caused by me searching for a term that was present in the actual filename. So, here's what I can deduce...

With binary documents, the search functionality returns a result if it finds the search term in the filename, but not in the document itself. This appears to be true for both PDFs and DOCs.

My questions are therefore now much simpler:

a) How do I enable full-text search on PDFs and DOCs?
b) How do I enable search on Comments?

Thanks,

Avatar
dalesaurus

Community Member, 283 Posts

19 August 2009 at 4:13am

Edited: 19/08/2009 4:19am

I am not very familiar with SS's searching capabilities but I am certain they are based on MySQL's full text indexing. This will generally not allow you to do what you are asking above with binary types. Well, at least not without some hacking.

If you are trying to search rich document types I would recommend using a software package that is designed to do just that instead of a CMS.

Apache SOLR would be a good standalone product you can integrate into your SS pages via built-in APIs
http://lucene.apache.org/solr/

It sounds like you might really be looking for a Document Management system. KnowlegeTree excels in this area:
http://www.knowledgetree.com/opensource

And hey, if you have the cash go for a Google Search Appliance. For ~$3000 it will do everything...EVERYTHING.
http://www.google.com/gsa

Avatar
Junglefish

Community Member, 109 Posts

19 August 2009 at 7:13pm

@Dalesaurus. Thanks for the info.

It's funny, when I was looking around at CMSs to evaluate, one of the killer features I thought SS had was the ability to search binary documents. I *swear* I read it in the documentation somewhere. Also, the guy in this thread seems to suggest it's possible http://ssorg.bigbird.silverstripe.com/archive/show/127856#post127856

But I take your point, and I guess I'm resigned to having to think about integrating a third-party solution.

Does anyone out there have any experience of integrating Lucene with SS and can offer any pointers or tips?

(PS. I'm not really doing "document management", but the application I need to build is an intranet where they make lots of core information available in PDF form. Therefore, the ability to search those PDFs is a key requirement.)

Avatar
dalesaurus

Community Member, 283 Posts

20 August 2009 at 2:54am

A super hacky solution would be to add a form for PDF Search and have it exec a grep command from php. There are CLI tools that can read PDFs, grep can return a filename, and you can use DataObject::get* to grab the objects by name to pack them in a DataObjectSet for pretty results.

http://stackoverflow.com/questions/694049/unable-to-search-pdf-files-contents-in-terminal

Its not ideal, but it's a pretty minimal amount of coding.

However if you are looking for an enterprise intranet that will have non-tech folks working with it, I would still recommend evaluating KnowledgeTree or FOSWiki.

Avatar
Junglefish

Community Member, 109 Posts

20 August 2009 at 7:35pm

@Dalesaurus. Thanks for your suggestions.

Foswiki looks interesting, and might actually be suitable for another project I'm meant to be looking at, but I REALLY want to use SS for this intranet project. So, I need to find a way of searching the binaries within SS and integrating them with the database search results.

I think I'll open a new thread with this requirement stated clearly. Someone out there must have had to nail this puppy before...

j/