22849 Posts in 9674 Topics by 2803 members
|Go to End|
18 August 2009 at 1:19am
Am currently evaluating an out-of-the-box (more-or-less) installation of SS, version 2.3.3. The search functionality is returning some rather odd results so I wanted to find out a) what I should expect out of the box and b) what can be achieved by tweaking if necessary.
(Apologies for the slight terseness in the wording below, but I was trying to be as concise as possible while stating what I'm trying to achieve.)
Here's what's currently happening:
...all return results. However, any COMMENTS in any of those pagetypes do not get returned.
Question: Is this what I should expect?
Question: How can I get the comments included in the search results?
When the PDF is uploaded, it is not immediately picked up by search.
When I use a re-director pagetype to link directly to a PDF, the PDF text is found in the search results.
When I use a hyperlink in the text of a normal page to simply point at a PDF, the text is NOT found in the search results.
Question: Is this what I should expect?
Random Word Document a)
As soon as was uploaded, it was picked up by search, even before adding it to any content.
Random Word Document b)
Is NOT picked up by search, even after using a re-director pagetype to link directly to it.
This seems very odd. Why would the two word docs be being handled so differently?
Question: At what point should DOCs and PDFs be included in search results? Just after upload? After they've been pulled up into the site in some way? If so, in what way?
Question: Is there some process I need to undertake in order to add binary documents to the search index? If not, what possible reasons might there be to explain why one word document is being found while the other one isn't?
All advice gratefully received...
18 August 2009 at 8:31pm
It appears I am mistaken on a couple of things. My 'inconsistencies' were caused by me searching for a term that was present in the actual filename. So, here's what I can deduce...
With binary documents, the search functionality returns a result if it finds the search term in the filename, but not in the document itself. This appears to be true for both PDFs and DOCs.
My questions are therefore now much simpler:
a) How do I enable full-text search on PDFs and DOCs?
b) How do I enable search on Comments?
19 August 2009 at 4:13am Last edited: 19 August 2009 4:19am
I am not very familiar with SS's searching capabilities but I am certain they are based on MySQL's full text indexing. This will generally not allow you to do what you are asking above with binary types. Well, at least not without some hacking.
If you are trying to search rich document types I would recommend using a software package that is designed to do just that instead of a CMS.
Apache SOLR would be a good standalone product you can integrate into your SS pages via built-in APIs
It sounds like you might really be looking for a Document Management system. KnowlegeTree excels in this area:
And hey, if you have the cash go for a Google Search Appliance. For ~$3000 it will do everything...EVERYTHING.
19 August 2009 at 7:13pm
@Dalesaurus. Thanks for the info.
It's funny, when I was looking around at CMSs to evaluate, one of the killer features I thought SS had was the ability to search binary documents. I *swear* I read it in the documentation somewhere. Also, the guy in this thread seems to suggest it's possible http://ssorg.bigbird.silverstripe.com/archive/show/127856#post127856
But I take your point, and I guess I'm resigned to having to think about integrating a third-party solution.
Does anyone out there have any experience of integrating Lucene with SS and can offer any pointers or tips?
(PS. I'm not really doing "document management", but the application I need to build is an intranet where they make lots of core information available in PDF form. Therefore, the ability to search those PDFs is a key requirement.)
20 August 2009 at 2:54am
A super hacky solution would be to add a form for PDF Search and have it exec a grep command from php. There are CLI tools that can read PDFs, grep can return a filename, and you can use DataObject::get* to grab the objects by name to pack them in a DataObjectSet for pretty results.
Its not ideal, but it's a pretty minimal amount of coding.
However if you are looking for an enterprise intranet that will have non-tech folks working with it, I would still recommend evaluating KnowledgeTree or FOSWiki.
20 August 2009 at 7:35pm
@Dalesaurus. Thanks for your suggestions.
Foswiki looks interesting, and might actually be suitable for another project I'm meant to be looking at, but I REALLY want to use SS for this intranet project. So, I need to find a way of searching the binaries within SS and integrating them with the database search results.
I think I'll open a new thread with this requirement stated clearly. Someone out there must have had to nail this puppy before...
|Go to Top|