10376 Posts in 2191 Topics by 1708 members
|Go to End||Next >|
2 August 2010 at 5:29pm
Thanks Mark, this works. The outputting of additional indexing information also helped me debug another issue.
Now just to get Aram sorted
3 August 2010 at 4:53am
My idx's do seem to be working, though many of the files are empty, some of them such as Product.new.spd and PageLive.words do now contain data.
Mark, I added the code and here is what it output: http://pastie.org/1071225
once again thank you both for the continued help!
3 August 2010 at 6:49am
@mark After some testing all is working well. The full file is however not being stored in file content cache. Not sure if you set this limit or if its the datatype (mediumtext) thats limited.
Now just need to highlight searched term in results and show partial string around first occurrence of keyword.
@aram I take it you are using sqlite. The warning shouldn't affect indexing but try and load the module it requests. You should be able to do so from within httpd.conf for your apache.
The PID not being accessible means something is still not 100% right permission wise. Is this a VPS with a well known host or your own VPS? If you cannot come right I can always make sure your VPS is configured correctly for you.
3 August 2010 at 9:31am
@aram I think the sqlite warnings are a red herring. The permissions failure is probably the issue. You'll notice it also says it hasn't rotated the indexes. When sphinx reindexes, it creates the new indexes in *.new.* files, and only switches the searchd to using these once they are constructed, so indexing doesn't result in search engine down time. Usually these files are very transient. But because its having problems talking to searchd, it is not able to rotate. So searchd will still be using the old indexes.
To understand the permissions, you need to start with apache. Apache is usually run as root so it can bind to privileged ports (i.e. port 80). But it can be configured to create sub-processes running as a different user to process requests (under debian this is usually the www-data user). Its the latter user you need to know about, because in this scenario it creates the important players: files under temp including sphinx conf, and the searchd process. If there is a mismatch there will probably be problems, so double check that everything is owned by the same user. To be sure, I would also ensure there is only one searchd running.
@enclave: I'd check that the documents are coming back in the search results. It is odd that the file results are not being cached. It implies that either the file contents are not being indexed at all, or it is having an issue storing it in the file; I'm suspicious either way (that may just be my nature :-p)
3 August 2010 at 9:37am
Hi Mark, sorry you may have misunderstood me. I mean the FileContentCache is being populated and files returned in the results. This is awesome, but the files are very large case documents and only the first few pages are extracted and stored. I can see the data stored abruptly ends at a certain point without indexing the full file
3 August 2010 at 9:50am
Ok, that's a better problem
The file contents cache is a medium text, so should hold up to 16M of text. If this is coming from a PDF, it should be smaller than the PDF, since pdf2text will only return the stream of words, without formatting or images etc. How long is the text that is stored in the field at the moment? Is it truncated because the field is too small, or could there be something else truncating it?
4 August 2010 at 1:30am Last edited: 5 August 2010 8:34pm
As far as I can see, everything this owned by root, but Enclave if you could take a look that would be awesome.
7 August 2010 at 1:56am
Just to update, I have given up with this, it seems there is something going on with my server/Apache setup which prevents Sphinx from working.....major bummer but I don't have enough knowledge to fix it
|Go to Top||Next >|