Grokbase
Topics Posts Groups | in
x
[ help ]

Colin Bell (colin...@gmail.com)

Profile | Posts (1)

User Information

Display Name:Colin Bell
Partial Email Address:colin...@gmail.com
Posts:
1 total
1 in Xapian

5 Most Recent

1) Colin Bell Re: [Xapian-discuss] UTF-8 Corruption
| +1 vote
Thanks James It's looking good http://lxr.mozilla.org/mozilla/source/intl/chardet/ >...
Xapian
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
Thanks James

It's looking good

http://lxr.mozilla.org/mozilla/source/intl/chardet/



On 20 Mar 2008, at 14:11, James Aylett wrote:

> On Thu, Mar 20, 2008 at 02:08:00PM +0000, Colin Bell wrote:
>
>>> There are ways to detect the character set of a file, though not
>>> always 100% reliably.
>>
>> Can anyone recommend some c++ code to do this?
>
> I assume, but don't know, that the Firefox/Mozilla ``magic'' charset
> detector is in C or C++ (the one that Mark Pilgrim ported to Python).
>
> J
>
> --  
> /--------------------------------------------------------------------------\
> James Aylett
> xapian.org
> [email protected: j...@tartarus.org]
> uncertaintydivision.org
>
> _______________________________________________
> Xapian-discuss mailing list
> [email protected: Xapian-di...@lists.xapian.org]
> http://lists.xapian.org/mailman/listinfo/xapian-discuss


_______________________________________________
Xapian-discuss mailing list
[email protected: Xapian-di...@lists.xapian.org]
http://lists.xapian.org/mailman/listinfo/xapian-discuss
2) Colin Bell Re: [Xapian-discuss] UTF-8 Corruption
| +1 vote
Thanks Olly Very much appreciated as always. I take it that Xapian::Utf8Iterator will only fix bad...
Xapian
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
Thanks Olly

Very much appreciated as always.

>
> If you pass data through Xapian::Utf8Iterator before doing anything
> with
> it, then this will fix bad UTF-8. This is essentially what omindex
> does to deal with this problem.

I take it that Xapian::Utf8Iterator will only fix bad UTF-8 not  
convert to UTF-8?

> On Fri, Mar 14, 2008 at 11:14:56PM +0000, Colin Bell wrote:
>> I was wondering if anyone every came across a problem I seem to be
>> having. I'm indexing in text files using some basic code written in
>> C+
>> +. The text files may or may not be in UTF-8, ISO 8859-1 or possibly
>> (but very rarely) even some other format - I have no way of knowing.
>
> There are ways to detect the character set of a file, though not
> always
> 100% reliably.


Can anyone recommend some c++ code to do this?

Regards

Colin


On 18 Mar 2008, at 03:56, Olly Betts wrote:

> On Fri, Mar 14, 2008 at 11:14:56PM +0000, Colin Bell wrote:
>> I was wondering if anyone every came across a problem I seem to be
>> having. I'm indexing in text files using some basic code written in
>> C+
>> +. The text files may or may not be in UTF-8, ISO 8859-1 or possibly
>> (but very rarely) even some other format - I have no way of knowing.
>
> There are ways to detect the character set of a file, though not
> always
> 100% reliably.
>
>> Question is, does Xapian convert none UTF-8 characters when it stores
>> the document. I think I read that UTF-8 is the default encoding for
>> Xapian, which is exactly what I am after.
>
> Most of Xapian treats things as opaque data. The classes which need
> to know are Xapian::Stem, Xapian::QueryParser, and
> Xapian::TermGenerator. The UTF-8 parsing used by the latter two will
> treat invalid sequences as if they were ISO-8859-1, which for
> real-world examples will almost always do the right thing when fed
> ISO-8859-1. Xapian::Stem uses Snowball's UTF-8 parsing code
> currently -
> I'm not sure how that handles invalid sequences.
>
>> The reason I'm asking is that I am getting some seriously corrupted
>> characters in the index. When they are displayed on Tomcat I get a
>> "sun.io.MalformedInputException" when trying to display the search
>> results. I have set the pages charset to UTF-8 and apparently this
>> error is thrown when when the streamreader detects characters that
>> are
>> not proper UTF-8 characters.
>
> If you set document data, document values, or directly add terms
> (using
> Document::add_posting() or Document::add_term()) then you'll get back
> what you put in verbatim. So if you pass in something which is
> invalid
> UTF-8, it will still be invalid.
>
> If you pass data through Xapian::Utf8Iterator before doing anything
> with
> it, then this will fix bad UTF-8. This is essentially what omindex
> does to deal with this problem.
>
> Cheers,
>    Olly


_______________________________________________
Xapian-discuss mailing list
[email protected: Xapian-di...@lists.xapian.org]
http://lists.xapian.org/mailman/listinfo/xapian-discuss
3) Colin Bell Re: [Xapian-discuss] Document snippet generation
| +1 vote
Hi Kevin Sorry to hear your having a problem. My compiler info is g++ (GCC) 4.1.2 20060928...
Xapian
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
Hi Kevin

Sorry to hear your having a problem. My compiler info is

g++ (GCC) 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)

As you can see it was developed and compiled on Ubuntu Linux. If you  
send me the errors I'll have a go at debugging it for you.

Regards

Colin


On 19 Mar 2008, at 20:54, Kevin Duraj wrote:

> Colin,
>
> Your code does not compile on Linux, I think it was written on Windows
> and I do not have much time to fix it. Even so, here is another great
> algorithm Gunning fog index.
> http://en.wikipedia.org/wiki/Gunning_fog_index
>
> Gunning fog index is designed to measure the readability of English
> writing. The resulting number is an indication of the number of years
> of formal education that a person requires in order to easily
> understand the text on the first reading. With Gunning fog index we
> could potentially measure the intelligence of a web page, assign boost
> value to it and get some great page ranking like Google does. :-)
>
> Kevin Duraj
> http://myhealthcare.com
>
>
> On Wed, Mar 19, 2008 at 12:37 PM, Colin Bell <colinabell@gmail.com>
> wrote:
>>
>> Hi Kevin
>>
>> I did attach the source code to the original posting but it seems
>> to not
>> made it through the mailing list. You can download it here. I am
>> using on
>> our company search and its doing a good job and is pretty fast.
>> Needs a bit
>> of tidying up and my C++ knowledge is very weak, could do with some
>> help.
>>
>> I will do some reading on the link you sent, thanks.
>>
>> http://www.cbell.info/XapSum.zip
>> Regards
>> Colin
>>
>>
>> On 19 Mar 2008, at 18:29, Kevin Duraj wrote:
>> On Tue, Mar 18, 2008 at 7:15 AM, Colin Bell <colinabell@gmail.com>
>> wrote:
>> Hi All
>>
>> Following on from a discussion that was flying around a while back
>> about document snippets (summaries). I have knocked together some
>> proof of concept code (C++) that uses the Xapian stemming ability and
>> sentence extraction (see http://en.wikipedia.org/wiki/Sentence_extraction)
>> . I also used the Open Text Summarizer project as an inspiration.
>>
>> It works quite well, but has some caveats which are explained in the
>> code comments. It can summarise, highlight sentences and highlight
>> words. It also has the ability to do context summaries. For example:
>> If you supply it with terms it will summarise the text within the
>> context of those terms.
>>
>> I am new to C++ programming so while your laughing out loud at the
>> poor coding, please keep that in mind. The code was assembled on an
>> Ubuntu Linux and comes with a Makefile. I have also supplied my
>> stopper class. For some reason the stopper still fails to stop some
>> of
>> the words in the stopper (like "the") if anyone knows why, please let
>> me know.
>>
>> Feedback / comments / changes / improvements are more than welcome -
>> bring it on. I really hope this sparks an interest.
>>
>> Regards
>>
>> Colin
>>
>>
>> Colin!
>>
>> Great job, it definitely sparks an interest. Can you share the code
>> with us,
>> or send the link where we can download it . I will run it against
>> myhealthcare.com 73 million document search engine using the sentence
>> summarizer, and we will see what kind of results we will get on the
>> top.
>> Hopefully, we will get rid of web sites using excessive keywords
>> stuffing
>> and spamdexing techniques.
>>
>> Did you have a chance to take a look at Flesh-Kincaid readability
>> algorithm
>> design to measure comprehension difficulty in English language?
>> http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test
>>
>> Kevin Duraj
>> http://myhealthcare.com
>>
>>


_______________________________________________
Xapian-discuss mailing list
[email protected: Xapian-di...@lists.xapian.org]
http://lists.xapian.org/mailman/listinfo/xapian-discuss
4) Colin Bell Re: [Xapian-discuss] Document snippet generation
| +1 vote
Hi Kevin I did attach the source code to the original posting but it seems to not made it through...
Xapian
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
Hi Kevin

I did attach the source code to the original posting but it seems to  
not made it through the mailing list. You can download it here. I am  
using on our company search and its doing a good job and is pretty  
fast. Needs a bit of tidying up and my C++ knowledge is very weak,  
could do with some help.

I will do some reading on the link you sent, thanks.

http://www.cbell.info/XapSum.zip

Regards

Colin


On 19 Mar 2008, at 18:29, Kevin Duraj wrote:

> On Tue, Mar 18, 2008 at 7:15 AM, Colin Bell <colinabell@gmail.com>
> wrote:
>> Hi All
>>
>> Following on from a discussion that was flying around a while back
>> about document snippets (summaries). I have knocked together some
>> proof of concept code (C++) that uses the Xapian stemming ability and
>> sentence extraction (see http://en.wikipedia.org/wiki/Sentence_extraction)
>> . I also used the Open Text Summarizer project as an inspiration.
>>
>> It works quite well, but has some caveats which are explained in the
>> code comments. It can summarise, highlight sentences and highlight
>> words. It also has the ability to do context summaries. For example:
>> If you supply it with terms it will summarise the text within the
>> context of those terms.
>>
>> I am new to C++ programming so while your laughing out loud at the
>> poor coding, please keep that in mind. The code was assembled on an
>> Ubuntu Linux and comes with a Makefile. I have also supplied my
>> stopper class. For some reason the stopper still fails to stop some
>> of
>> the words in the stopper (like "the") if anyone knows why, please let
>> me know.
>>
>> Feedback / comments / changes / improvements are more than welcome -
>> bring it on. I really hope this sparks an interest.
>>
>> Regards
>>
>> Colin
>>
>
> Colin!
>
> Great job, it definitely sparks an interest. Can you share the code
> with us, or send the link where we can download it . I will run it
> against myhealthcare.com 73 million document search engine using the
> sentence summarizer, and we will see what kind of results we will
> get on the top. Hopefully, we will get rid of web sites using
> excessive keywords stuffing and spamdexing techniques.
>
> Did you have a chance to take a look at Flesh-Kincaid readability
> algorithm design to measure comprehension difficulty in English
> language?
> http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test
>
> Kevin Duraj
> http://myhealthcare.com

_______________________________________________
Xapian-discuss mailing list
[email protected: Xapian-di...@lists.xapian.org]
http://lists.xapian.org/mailman/listinfo/xapian-discuss
5) Colin Bell Re: [Xapian-discuss] Document snippet generation
| +1 vote
Sorry, I was in a rush as usual. Try http://cbell.info/XapSum.zip ...
Xapian
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
Sorry, I was in a rush as usual. Try http://cbell.info/XapSum.zip


On 18/03/2008, Richard Boulton <richard@lemurconsulting.com> wrote:
> Colin Bell wrote:
> > As per my previous message about summarisation
> >
> > You can download the code here
> >
> > http://www.cbell.info/files/XapSum.zip
>
> That URL gives, for me:
>
> Error 403
>
> I'm sorry, but directory access is denied. Try searching for
> information from our search area above or by navigating using the
> navigation to the left. If your having trouble finding information,
> please use the "Contact Us" section to give us site feedback.
>
> --
> Richard
>

_______________________________________________
Xapian-discuss mailing list
[email protected: Xapian-di...@lists.xapian.org]
http://lists.xapian.org/mailman/listinfo/xapian-discuss

spacer
Profile | Posts (1)
Home > People > Colin Bell