I'm working on an app in which longish text chunks (could be up to a few MB in size, and stored either in flat text files or in fields of database records - TBD) need to be searched for the presence of a combination of string constants, where the string constants can be combined with boolean operators (in the user input given in a web form, for the search). The DB is MongoDB, accessed from Python using the PyMongo driver / client.
The text chunks are associated with structured fields of database records, of users (say, suppliers), so the end result wanted is that the app user should be able to specify a search using boolean operators and the app should search across all the text chunks belonging to all users (users to text chunks is a one-to-many relationship, whether the chunks are stored as long varchar in the DB or as flat files, with a field of a database record containing the filename of the text file containing the chunk). Then the app should return both the user names and the corresponding text chunks, for chunks that matched the boolean expression.
Example return value:
(where the filenames.txt could instead be the chunks of text, if the chunks are stored in the DB instead of the file system).
The boolean expressions can be like these:
apple and orange
apple or pear
apple and (pear or mango)
(apple and pear) or mango
apple or not (pear and orange)
What are some good approaches to doing this using Python?
I'm first looking at pure Python code approaches, though have considered using search tools like Solr (via PySolr), ElasticSearch, etc. Right now the app is in the prototype stage so I first want to establish the feasibility and a simple approach to doing the search, though later, if the app needs to scale, may investigate the tools like PySolr or ES.
I think I may need to use one of the parsing libraries for Python such as PLY and others mentioned here:
Let's say a parse tree is built up by whichever approach is used. A related question would be how to use this parse tree to search in the text chunks.
I realize that it is a somewhat complex questiom, so am not looking for complete solutions, but for suggestions on how to go about it.
Thanks for any suggestions.