10 January 2014 at 4:10pm
I would appreciate hearing your recommendations/experiences with site searches that include PDF files.
I have a small site that includes PDFs and the wish is for these to be included in the search. I am thinking at adding the Google Site Search Module but the link in the documentation goes to the Google Enterprise Search. Can the custom search in the Google Webmaster tools be used instead? It would be hard to justify the cost of the Google Enterprise Search. However, the Google forums contain many posts concerning the difficulties of searching PDF files!
10 January 2014 at 10:26pm
If you want an easy solution, the Google site search module will include file contents (https://github.com/dnadesign/silverstripe-googlesitesearch) and was free for up to 1000 searches a month when I released it. An example of it running is on http://tnzi.com/
The other option would be Solr + Text Extraction (http://addons.silverstripe.org/add-ons/silverstripe/textextraction) which would require non trivial setup costs so I'm a fan of Google search for quick use.
16 January 2014 at 2:46pm
Willr, thank you very much for your reply.
As Google are now charging $100 a year for the enterprise version of their search engine, I have been attempting to set up the free version to find files (pdf, docx, pptx etc). (Unfortunately, it does not appear to supply a cse_key to enable me to use your googlesitesearch module.)
Google finds no errors in the sitemap and now appears to have indexed the site's pages. Still doesn't find any linked files though! The links are found under a members' only section but I have seen a forum post asking how to avoid such files from appearing in the search results so that should not be the problem. Also Google reports no blocked URLs.
Any ideas about what newbie mistake I must be making?
17 January 2014 at 5:16pm
I believe that I have isolated the problem to Google search not searching the pages containing the links to documents. It is not finding non-link text on those pages. In the sitemap, the three pages have a 50% priority. I have increased their importance in the CMS but this is yet to show in the sitemap. I hope that giving Google time to reindex the site will solve the problems.