FAQ
[ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976862#action_12976862 ]

Mark Miller commented on SOLR-2129:
-----------------------------------

bq. I have no problem committing this to contrib so future iterations can be from svn. any objections?

+1 - getting into trunk will likely expand usage and feedback, and get things rolling much faster. Bar is much lower for Solr contrib as well.


I've only started looking at the patch, but a few notes I jotted down:

StringBuffer usage in UpdateRequestProcessor - should be StringBuilder right?

The below is a little odd, no (critical code I know ;) )?

/* execute the AE on the given JCas */
private void executeAE(AnalysisEngine ae, JCas jcas) throws AnalysisEngineProcessException {
ae.getLogger().log(Level.INFO, new StringBuffer("Analazying text").toString());
ae.process(jcas);
ae.getLogger().log(Level.INFO, new StringBuffer("Text processing completed").toString());
}


AEProviderFactory should be thread safe?? At a min, you have to consider multicore ... consider that you could be sharing AEProvider across threads because of this as well (static cache in AEProviderFactory). Perhaps the cache should not be static?


Don't want to at least log this?

} catch (AnalysisEngineProcessException e) {
// do nothing
}


Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
-------------------------------------------------------------------------------

Key: SOLR-2129
URL: https://issues.apache.org/jira/browse/SOLR-2129
Project: Solr
Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Robert Muir
Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129-version2.patch, SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch


Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Search Discussions

  • Tommaso Teofili (JIRA) at Jan 4, 2011 at 9:06 am
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977177#action_12977177 ]

    Tommaso Teofili commented on SOLR-2129:
    ---------------------------------------

    bq. StringBuffer usage in UpdateRequestProcessor - should be StringBuilder right?

    yes, right.

    bq. private void executeAE(AnalysisEngine ae, JCas jcas) throws AnalysisEngineProcessException { ae.getLogger().log(Level.INFO, new StringBuffer("Analazying text").toString()); ae.process(jcas); ae.getLogger().log(Level.INFO, new StringBuffer("Text processing completed").toString()); }

    I wanted to logically isolate everything regarding actual processing of text, but I agree that this piece of code would look better inside the calling method ( processText(String) ).

    bq. AEProviderFactory should be thread safe?? At a min, you have to consider multicore ... consider that you could be sharing AEProvider across threads because of this as well (static cache in AEProviderFactory). Perhaps the cache should not be static?

    Thanks Mark for this, I agree the cache shouldn't be static especially in cases where each core has AEs with same classpaths but different runtime parameters.
    For what concerns OverridingParamsAEProvider (the only AEProvider impl available at the moment) being processed by different threads we can make the getAE() method synchronized (or, perhaps, making cachedAE field volatile, but need to check better).

    bq. Don't want to at least log this? } catch (AnalysisEngineProcessException e) { // do nothing }

    I wanted the UIMA enrichment pipeline to be error safe but I agree it'd be reasonable to log the error in this case (even if I don't like logging exceptions in general).

    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Assignee: Robert Muir
    Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129-version2.patch, SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Tommaso Teofili (JIRA) at Jan 4, 2011 at 11:02 am
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977226#action_12977226 ]

    Tommaso Teofili commented on SOLR-2129:
    ---------------------------------------

    Just forgot to say: I'll create a new patch from the above considerations :-)
    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Assignee: Robert Muir
    Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129-version2.patch, SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Lance Norskog (JIRA) at Jan 9, 2011 at 4:22 am
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979264#action_12979264 ]

    Lance Norskog commented on SOLR-2129:
    -------------------------------------

    bq. Don't want to at least log this? } catch (AnalysisEngineProcessException e) { // do nothing }

    bq. I wanted the UIMA enrichment pipeline to be error safe but I agree it'd be reasonable to log the error in this case (even if I don't like logging exceptions in general).

    Please do not hide errors in any way. Nobody reads logs. If it fails in production, I want to know immediately and fix it. Please just throw all exceptions up the stack.
    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Assignee: Robert Muir
    Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129-version-5.patch, SOLR-2129-version2.patch, SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Tommaso Teofili (JIRA) at Jan 11, 2011 at 9:08 am
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980010#action_12980010 ]

    Tommaso Teofili commented on SOLR-2129:
    ---------------------------------------

    bq. Please do not hide errors in any way. Nobody reads logs. If it fails in production, I want to know immediately and fix it. Please just throw all exceptions up the stack.

    I think your point is a good one Lance, when I started working on this patch I wanted to avoid breaking the indexing pipeline (as this was an "add-on") but now that it's more stable I agree that any exception should be thrown.
    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Assignee: Robert Muir
    Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129-version-5.patch, SOLR-2129-version2.patch, SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Tommaso Teofili (JIRA) at Jan 24, 2011 at 6:55 am
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985537#action_12985537 ]

    Tommaso Teofili commented on SOLR-2129:
    ---------------------------------------

    Thanks Robert for taking care :-)
    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Assignee: Robert Muir
    Fix For: 3.1, 4.0

    Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129-version-5.patch, SOLR-2129-version-6.patch, SOLR-2129-version2.patch, SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    More information can be found on the dedicated wiki page: http://wiki.apache.org/solr/SolrUIMA
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedJan 3, '11 at 6:37p
activeJan 24, '11 at 6:55a
posts6
users1
websitelucene.apache.org

1 user in discussion

Tommaso Teofili (JIRA): 6 posts

People

Translate

site design / logo © 2021 Grokbase