FAQ

[jira] Created: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

Tommaso Teofili (JIRA)
Sep 22, 2010 at 5:56 am
Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
-------------------------------------------------------------------------------

Key: SOLR-2129
URL: https://issues.apache.org/jira/browse/SOLR-2129
Project: Solr
Issue Type: New Feature
Reporter: Tommaso Teofili


Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.

Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
reply

Search Discussions

19 responses

  • Tommaso Teofili (JIRA) at Sep 24, 2010 at 12:40 pm
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tommaso Teofili updated SOLR-2129:
    ----------------------------------

    Attachment: SOLR-2129.patch

    Patch to port solr-uima GC project as a solr/contrib module
    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Attachments: SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Tommaso Teofili (JIRA) at Sep 24, 2010 at 1:10 pm
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tommaso Teofili updated SOLR-2129:
    ----------------------------------

    Attachment: SOLR-2129-asf-headers.patch

    Same patch plus required ASF headers on code and xml
    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Attachments: SOLR-2129-asf-headers.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Sep 24, 2010 at 5:27 pm
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914548#action_12914548 ]

    Robert Muir commented on SOLR-2129:
    -----------------------------------

    Hello, is it possible you could upload the jar files to this issue that it depends on?

    I tried to get them to test the patch, but i think there are problems in maven-land with Alchemy:

    http://repository.apache.org/snapshots/org/apache/uima/alchemy-annotator/2.3.1-SNAPSHOT/

    as you can see, the jar file is very out of date.

    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Attachments: SOLR-2129-asf-headers.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Tommaso Teofili (JIRA) at Sep 25, 2010 at 9:19 am
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tommaso Teofili updated SOLR-2129:
    ----------------------------------

    Attachment: lib-jars.zip

    Hello Robert, in attachment you can find an archive containing all lib/*.jar files
    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Sep 25, 2010 at 11:57 am
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914809#action_12914809 ]

    Robert Muir commented on SOLR-2129:
    -----------------------------------

    Thanks Tommaso!

    I applied the patch: the build and tests work correctly, there aren't any intl/localization issues, and the code looks clean.

    Would another committer more familiar with these parts of Solr take a look? It looks like a good feature.

    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Tommaso Teofili (JIRA) at Oct 4, 2010 at 3:10 pm
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917634#action_12917634 ]

    Tommaso Teofili commented on SOLR-2129:
    ---------------------------------------

    Hello Robert,
    as it seems this patch hasn't been committed yet, I wonder if there is anything I should do or may help with.
    If so, please let me know that :)

    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 5, 2010 at 12:43 pm
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917957#action_12917957 ]

    Robert Muir commented on SOLR-2129:
    -----------------------------------

    Hi Tommaso: i was hoping to get another person to look at it, since it is not my area of expertise.

    But no one is stepping up, so I will take it. It will take me longer to review it though (sorry)!

    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 5, 2010 at 12:44 pm
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir reassigned SOLR-2129:
    ---------------------------------

    Assignee: Robert Muir
    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Assignee: Robert Muir
    Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 5, 2010 at 1:05 pm
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917961#action_12917961 ]

    Robert Muir commented on SOLR-2129:
    -----------------------------------

    Tommaso: I noticed the following in the maven configuration:

    {noformat}
    <source>1.6</source>
    <target>1.6</target>
    {noformat}

    But I took the patch and applied to branch_3x (java 5-only), and just removed 3 interface @Overrides and everything worked with java 5
    Can you confirm this is correct (that UIMA does not require java 6)?

    If the patch only needs java 5, then it makes it possible to apply to our 3.x branch also.

    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Assignee: Robert Muir
    Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Jörn Kottmann (JIRA) at Oct 5, 2010 at 1:53 pm
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917981#action_12917981 ]

    Jörn Kottmann commented on SOLR-2129:
    -------------------------------------

    I am also interested in using this patch. Is it possible to run custom UIMA analysis or only the pre-defined AlchemyAPI analysis ?
    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Assignee: Robert Muir
    Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Tommaso Teofili (JIRA) at Oct 5, 2010 at 3:58 pm
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918037#action_12918037 ]

    Tommaso Teofili commented on SOLR-2129:
    ---------------------------------------

    Robert thanks for that, I confirm that UIMA doesn't require java 6, java 5 is fine so this is fine for branc_3x too.

    Jörn, good to see you here too :) you can run also custom UIMA Analysis.
    By default the default AEs are WhitespaceTokenizer, Tagger, AlchemyAPIAnnotator, OpenCalaisAnnotator.


    To customize the default behavior you should:
    a) change the OverridingParamsExtServicesAEDescriptor and (eventually) eventually extend BaseUIMAUpdateRequestProcessor and its SolrUIMAConsumers

    or

    b) define a new AE descriptor and create for it a new class extending UIMAUpdateRequestProcessor (or extend BaseUIMAUpdateRequestProcessor) then modify the UIMAUpdateRequestProcessorFactory to initialize that class instead of the base one.


    If you need any parameters to be set at runtime for a delegate AE, you must set, inside the aggregate AE, an overriding parameter that overrides some parameter in the delegate AE and then define its runtime value in solrconfig with:

    <uimaConfig>
    <runtimeParameters>
    <overriding_param_name>RUNTIMEVALUE</overriding_param_name>
    </runtimeParameters>
    </uimaConfig>



    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Assignee: Robert Muir
    Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Oct 5, 2010 at 4:07 pm
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918041#action_12918041 ]

    Mark Miller commented on SOLR-2129:
    -----------------------------------

    I'm going to take a look at this when i get a chance as well. This looks like solid stuff.
    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Assignee: Robert Muir
    Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Grant Ingersoll (JIRA) at Oct 25, 2010 at 7:42 pm
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924691#action_12924691 ]

    Grant Ingersoll commented on SOLR-2129:
    ---------------------------------------

    Cool stuff, Tommaso. I'm starting to look at adding classifiers into Solr via Mahout, so thought I would look at this too.

    Couple of early things, based on looking at the getting started instructions.

    # I think we should do like we do with Tika and provide a way for users to map UIMA output to Solr fields as opposed to having to hardcode in specific fields.
    # For the Jars, have a look at how the clustering is setup. We should be able to just point at the UIMA libs in solrconfig.xml under contrib/uima/lib instead of having to copy them around


    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Assignee: Robert Muir
    Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Tommaso Teofili (JIRA) at Oct 26, 2010 at 2:04 pm
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924971#action_12924971 ]

    Tommaso Teofili commented on SOLR-2129:
    ---------------------------------------

    Hi Grant, I think it would be great to have Mahout classifiers inside Solr :)

    I like your suggestion at point 1.
    I can change the current hardcoded mapping mechanism using instead a simple mapping between UIMA extracted types/features and field names defined inside solrconfig.xml.

    A different option could be to develop a SolrCASConsumer component in UIMA (similar to Lucas [1], Lucene CAS Consumer) providing full control on how UIMA annotations and features can be mapped to Solr fields, but on UIMA side ;)

    Regarding point 2 the jars are already under contrib/uima/lib so I can modify the sample solrconfig.xml adding the proper <lib> tag.
    Thanks for your comments and suggestions.

    [1] : https://svn.apache.org/repos/asf/uima/sandbox/trunk/Lucas
    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Assignee: Robert Muir
    Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Grant Ingersoll (JIRA) at Oct 26, 2010 at 6:06 pm
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925068#action_12925068 ]

    Grant Ingersoll commented on SOLR-2129:
    ---------------------------------------

    bq. I can change the current hardcoded mapping mechanism using instead a simple mapping between UIMA extracted types/features and field names defined inside solrconfig.xml

    Try to reuse the same syntax as the mapping in the ExtractingRequestHandler.

    bq. A different option could be to develop a SolrCASConsumer component in UIMA (similar to Lucas [1], Lucene CAS Consumer) providing full control on how UIMA annotations and features can be mapped to Solr fields, but on UIMA side

    I've been struggling with these kinds of questions a lot lately. That is, the marriage of two projects. Where should the code go? Setting up another ASF project is a pain in the amount of hoops to jump through. Apache Labs doesn't cut it for a number of reasons. Hosting on Github or Google Code is OK, but loses the ASF community aspect. Sigh.

    bq. Regarding point 2 the jars are already under contrib/uima/lib so I can modify the sample solrconfig.xml adding the proper <lib> tag.

    Yep, exactly what I had in mind.
    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Assignee: Robert Muir
    Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Tommaso Teofili (JIRA) at Oct 27, 2010 at 1:52 pm
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925371#action_12925371 ]

    Tommaso Teofili commented on SOLR-2129:
    ---------------------------------------

    bq. Try to reuse the same syntax as the mapping in the ExtractingRequestHandler.

    ok, I added the <lib> tag and will commit a new patch when I'm finished with these changes

    bq. I've been struggling with these kinds of questions a lot lately. That is, the marriage of two projects. Where should the code go? Setting up another ASF project is a pain in the amount of hoops to jump through. Apache Labs doesn't cut it for a number of reasons. Hosting on Github or Google Code is OK, but loses the ASF community aspect. Sigh.

    I agree with your point; I don't think it's easy to come with a final good and general answer for such situations.

    What comes to my mind to solve it generally is establishing a single wide-purpose ASF project which contains integrations between many different ASF projects, this could be good to prepare the base for two projects that want to "marry" but it could be too much general and maybe not easy to maintain from a community point of view (e.g.: should all the Lucene committers commit on "integrations" project too only because someone integrated it with UIMA?); another option could be to force two marrying projects to respect a standard (e.g. CMIS) so that developing a specialized "connector" wouldn't be needed anymore but I don't think it's always possible to do so since it could require a huge effort.

    In this particular case, in my opinion, the code should go into the proper project depending on which "pipeline" is being changed/enhanced. Therefore since in this Solr-UIMA integration we're adding a step to the Solr indexing process via an UpdateRequestProcessor I think it should be part of Solr codebase whereas since in the SolrCASConsumer we'd be adding a (final) Consumer to the UIMA pipeline that should be part of UIMA codebase.

    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Assignee: Robert Muir
    Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Tommaso Teofili (JIRA) at Nov 5, 2010 at 7:53 am
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928531#action_12928531 ]

    Tommaso Teofili commented on SOLR-2129:
    ---------------------------------------

    bq. Try to reuse the same syntax as the mapping in the ExtractingRequestHandler.

    Inside <uimaConfig> there are many possible ways that configuration can be defined.
    Let's say we want to map the feature 'text' of type 'ConceptFS' on the field 'concept', I thought 3 options, listed here

    1. exactly same syntax as ExtractingRequestHandler, though Solr-UIMA is not a RequestHandler but an UpdateRequestProcessor; could this create confusion?
    <lst name="defaults">
    <str name="fmap.org.apache.uima.alchemy.ts.categorization.ConceptFS@text">concept</str>
    </lst>

    2. define the feature of a type to map over a field with one tag
    <map field="concept" feature="org.apache.uima.alchemy.ts.categorization.ConceptFS@text"/>

    3. have a more hierarchical and strict structure, though not so immediate to understand and maybe easier for UIMA experts
    <type name="org.apache.uima.alchemy.ts.categorization.ConceptFS">
    <feature name="text">concept</feature>
    </type>

    What do you think?
    Thanks for any advice,
    Tommaso
    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Assignee: Robert Muir
    Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Tommaso Teofili (JIRA) at Nov 10, 2010 at 9:59 am
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930562#action_12930562 ]

    Tommaso Teofili commented on SOLR-2129:
    ---------------------------------------

    I think I found the following good compromise:

    bq.
    <type name="org.apache.uima.jcas.tcas.Annotation">
    <map feature="coveredText" field="tag"/>
    </type>

    I've also made configurable (in solrconfig.xml) the fields to analyze and analysis engine path
    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Assignee: Robert Muir
    Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Tommaso Teofili (JIRA) at Nov 30, 2010 at 9:48 am
    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965175#action_12965175 ]

    Tommaso Teofili commented on SOLR-2129:
    ---------------------------------------

    Hi all, in case someone had a chance to try the latest patch please let me know your feedback.

    Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
    -------------------------------------------------------------------------------

    Key: SOLR-2129
    URL: https://issues.apache.org/jira/browse/SOLR-2129
    Project: Solr
    Issue Type: New Feature
    Reporter: Tommaso Teofili
    Assignee: Robert Muir
    Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129-version2.patch, SOLR-2129.patch


    Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
    The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
    Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
    The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post

1 user in discussion

Tommaso Teofili (JIRA): 20 posts