User Information
| Display Name: | Colin Bell |
|---|
| Partial Email Address: | colin...@gmail.com |
| Posts: |
|
| 1) Colin Bell Re: [Xapian-discuss] UTF-8 Corruption |
|
|
| Thanks James It's looking good http://lxr.mozilla.org/mozilla/source/intl/chardet/ >... |
|
|
|
|
|
|
|
Thanks James It's looking good http://lxr.mozilla.org/mozilla/source/intl/chardet/
On 20 Mar 2008, at 14:11, James Aylett wrote:
> On Thu, Mar 20, 2008 at 02:08:00PM +0000, Colin Bell wrote: > >>> There are ways to detect the character set of a file, though not >>> always 100% reliably. >> >> Can anyone recommend some c++ code to do this? > > I assume, but don't know, that the Firefox/Mozilla ``magic'' charset > detector is in C or C++ (the one that Mark Pilgrim ported to Python). > > J > > -- > /--------------------------------------------------------------------------\ > James Aylett > xapian.org > [email protected: j...@tartarus.org] > uncertaintydivision.org > > _______________________________________________ > Xapian-discuss mailing list > [email protected: Xapian-di...@lists.xapian.org] > http://lists.xapian.org/mailman/listinfo/xapian-discuss
_______________________________________________ Xapian-discuss mailing list [email protected: Xapian-di...@lists.xapian.org] http://lists.xapian.org/mailman/listinfo/xapian-discuss
|
|
|
| 2) Colin Bell Re: [Xapian-discuss] UTF-8 Corruption |
|
|
| Thanks Olly Very much appreciated as always. I take it that Xapian::Utf8Iterator will only fix bad... |
|
|
|
|
|
|
|
Thanks Olly Very much appreciated as always. >> If you pass data through Xapian::Utf8Iterator before doing anything > with> it, then this will fix bad UTF-8. This is essentially what omindex> does to deal with this problem.I take it that Xapian::Utf8Iterator will only fix bad UTF-8 not convert to UTF-8? > On Fri, Mar 14, 2008 at 11:14:56PM +0000, Colin Bell wrote:>> I was wondering if anyone every came across a problem I seem to be>> having. I'm indexing in text files using some basic code written in >> C+>> +. The text files may or may not be in UTF-8, ISO 8859-1 or possibly>> (but very rarely) even some other format - I have no way of knowing.>> There are ways to detect the character set of a file, though not > always> 100% reliably.Can anyone recommend some c++ code to do this? Regards Colin
On 18 Mar 2008, at 03:56, Olly Betts wrote:
> On Fri, Mar 14, 2008 at 11:14:56PM +0000, Colin Bell wrote: >> I was wondering if anyone every came across a problem I seem to be >> having. I'm indexing in text files using some basic code written in >> C+ >> +. The text files may or may not be in UTF-8, ISO 8859-1 or possibly >> (but very rarely) even some other format - I have no way of knowing. > > There are ways to detect the character set of a file, though not > always > 100% reliably. > >> Question is, does Xapian convert none UTF-8 characters when it stores >> the document. I think I read that UTF-8 is the default encoding for >> Xapian, which is exactly what I am after. > > Most of Xapian treats things as opaque data. The classes which need > to know are Xapian::Stem, Xapian::QueryParser, and > Xapian::TermGenerator. The UTF-8 parsing used by the latter two will > treat invalid sequences as if they were ISO-8859-1, which for > real-world examples will almost always do the right thing when fed > ISO-8859-1. Xapian::Stem uses Snowball's UTF-8 parsing code > currently - > I'm not sure how that handles invalid sequences. > >> The reason I'm asking is that I am getting some seriously corrupted >> characters in the index. When they are displayed on Tomcat I get a >> "sun.io.MalformedInputException" when trying to display the search >> results. I have set the pages charset to UTF-8 and apparently this >> error is thrown when when the streamreader detects characters that >> are >> not proper UTF-8 characters. > > If you set document data, document values, or directly add terms > (using > Document::add_posting() or Document::add_term()) then you'll get back > what you put in verbatim. So if you pass in something which is > invalid > UTF-8, it will still be invalid. > > If you pass data through Xapian::Utf8Iterator before doing anything > with > it, then this will fix bad UTF-8. This is essentially what omindex > does to deal with this problem. > > Cheers, > Olly
_______________________________________________ Xapian-discuss mailing list [email protected: Xapian-di...@lists.xapian.org] http://lists.xapian.org/mailman/listinfo/xapian-discuss
|
|
|
| 3) Colin Bell Re: [Xapian-discuss] Document snippet generation |
|
|
| Hi Kevin Sorry to hear your having a problem. My compiler info is g++ (GCC) 4.1.2 20060928... |
|
|
|
|
|
|
|
Hi Kevin Sorry to hear your having a problem. My compiler info is g++ (GCC) 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5) As you can see it was developed and compiled on Ubuntu Linux. If you send me the errors I'll have a go at debugging it for you. Regards Colin
On 19 Mar 2008, at 20:54, Kevin Duraj wrote:
> Colin, > > Your code does not compile on Linux, I think it was written on Windows > and I do not have much time to fix it. Even so, here is another great > algorithm Gunning fog index. > http://en.wikipedia.org/wiki/Gunning_fog_index > > Gunning fog index is designed to measure the readability of English > writing. The resulting number is an indication of the number of years > of formal education that a person requires in order to easily > understand the text on the first reading. With Gunning fog index we > could potentially measure the intelligence of a web page, assign boost > value to it and get some great page ranking like Google does. :-) > > Kevin Duraj > http://myhealthcare.com > > > On Wed, Mar 19, 2008 at 12:37 PM, Colin Bell <colinabell@gmail.com> > wrote: >> >> Hi Kevin >> >> I did attach the source code to the original posting but it seems >> to not >> made it through the mailing list. You can download it here. I am >> using on >> our company search and its doing a good job and is pretty fast. >> Needs a bit >> of tidying up and my C++ knowledge is very weak, could do with some >> help. >> >> I will do some reading on the link you sent, thanks. >> >> http://www.cbell.info/XapSum.zip >> Regards >> Colin >> >> >> On 19 Mar 2008, at 18:29, Kevin Duraj wrote: >> On Tue, Mar 18, 2008 at 7:15 AM, Colin Bell <colinabell@gmail.com> >> wrote: >> Hi All >> >> Following on from a discussion that was flying around a while back >> about document snippets (summaries). I have knocked together some >> proof of concept code (C++) that uses the Xapian stemming ability and >> sentence extraction (see http://en.wikipedia.org/wiki/Sentence_extraction) >> . I also used the Open Text Summarizer project as an inspiration. >> >> It works quite well, but has some caveats which are explained in the >> code comments. It can summarise, highlight sentences and highlight >> words. It also has the ability to do context summaries. For example: >> If you supply it with terms it will summarise the text within the >> context of those terms. >> >> I am new to C++ programming so while your laughing out loud at the >> poor coding, please keep that in mind. The code was assembled on an >> Ubuntu Linux and comes with a Makefile. I have also supplied my >> stopper class. For some reason the stopper still fails to stop some >> of >> the words in the stopper (like "the") if anyone knows why, please let >> me know. >> >> Feedback / comments / changes / improvements are more than welcome - >> bring it on. I really hope this sparks an interest. >> >> Regards >> >> Colin >> >> >> Colin! >> >> Great job, it definitely sparks an interest. Can you share the code >> with us, >> or send the link where we can download it . I will run it against >> myhealthcare.com 73 million document search engine using the sentence >> summarizer, and we will see what kind of results we will get on the >> top. >> Hopefully, we will get rid of web sites using excessive keywords >> stuffing >> and spamdexing techniques. >> >> Did you have a chance to take a look at Flesh-Kincaid readability >> algorithm >> design to measure comprehension difficulty in English language? >> http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test >> >> Kevin Duraj >> http://myhealthcare.com >> >>
_______________________________________________ Xapian-discuss mailing list [email protected: Xapian-di...@lists.xapian.org] http://lists.xapian.org/mailman/listinfo/xapian-discuss
|
|
|
| 4) Colin Bell Re: [Xapian-discuss] Document snippet generation |
|
|
| Hi Kevin I did attach the source code to the original posting but it seems to not made it through... |
|
|
|
|
|
|
|
Hi Kevin I did attach the source code to the original posting but it seems to not made it through the mailing list. You can download it here. I am using on our company search and its doing a good job and is pretty fast. Needs a bit of tidying up and my C++ knowledge is very weak, could do with some help. I will do some reading on the link you sent, thanks. http://www.cbell.info/XapSum.zipRegards Colin
On 19 Mar 2008, at 18:29, Kevin Duraj wrote:
> On Tue, Mar 18, 2008 at 7:15 AM, Colin Bell <colinabell@gmail.com> > wrote: >> Hi All >> >> Following on from a discussion that was flying around a while back >> about document snippets (summaries). I have knocked together some >> proof of concept code (C++) that uses the Xapian stemming ability and >> sentence extraction (see http://en.wikipedia.org/wiki/Sentence_extraction) >> . I also used the Open Text Summarizer project as an inspiration. >> >> It works quite well, but has some caveats which are explained in the >> code comments. It can summarise, highlight sentences and highlight >> words. It also has the ability to do context summaries. For example: >> If you supply it with terms it will summarise the text within the >> context of those terms. >> >> I am new to C++ programming so while your laughing out loud at the >> poor coding, please keep that in mind. The code was assembled on an >> Ubuntu Linux and comes with a Makefile. I have also supplied my >> stopper class. For some reason the stopper still fails to stop some >> of >> the words in the stopper (like "the") if anyone knows why, please let >> me know. >> >> Feedback / comments / changes / improvements are more than welcome - >> bring it on. I really hope this sparks an interest. >> >> Regards >> >> Colin >> > > Colin! > > Great job, it definitely sparks an interest. Can you share the code > with us, or send the link where we can download it . I will run it > against myhealthcare.com 73 million document search engine using the > sentence summarizer, and we will see what kind of results we will > get on the top. Hopefully, we will get rid of web sites using > excessive keywords stuffing and spamdexing techniques. > > Did you have a chance to take a look at Flesh-Kincaid readability > algorithm design to measure comprehension difficulty in English > language? > http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test > > Kevin Duraj > http://myhealthcare.com
_______________________________________________ Xapian-discuss mailing list [email protected: Xapian-di...@lists.xapian.org] http://lists.xapian.org/mailman/listinfo/xapian-discuss
|
|
|
| 5) Colin Bell Re: [Xapian-discuss] Document snippet generation |
|
|
| Sorry, I was in a rush as usual. Try http://cbell.info/XapSum.zip ... |
|
|
|
|
|
|
|
Sorry, I was in a rush as usual. Try http://cbell.info/XapSum.zip
On 18/03/2008, Richard Boulton <richard@lemurconsulting.com> wrote: > Colin Bell wrote: > > As per my previous message about summarisation > > > > You can download the code here > > > > http://www.cbell.info/files/XapSum.zip > > That URL gives, for me: > > Error 403 > > I'm sorry, but directory access is denied. Try searching for > information from our search area above or by navigating using the > navigation to the left. If your having trouble finding information, > please use the "Contact Us" section to give us site feedback. > > -- > Richard >
_______________________________________________ Xapian-discuss mailing list [email protected: Xapian-di...@lists.xapian.org] http://lists.xapian.org/mailman/listinfo/xapian-discuss
|
|
|
|
 | |