Ahmed Al-Obaidy contacted me about some ideas for improving Lucene's
Arabic support. The background is that we use a very very simple
stemming algorithm that doesn't really attempt to handle the nuances
of Arabic language, especially things like broken plurals (~ 1/2 of
all arabic noun plurals are not stemmed in a useful way)
He is researching some solutions to these problems, and is considering
building an open source Arabic language test collection, using
articles from wikipedia as the corpus. I know wikipedia has its own
wierd properties sometimes, but so does news text or anything else
people use for these purposes, so I think its just fine myself.
He was asking me a few questions about advice for building judgements
and things like that, and so I asked his permission to move it to this
list, in case anyone has any suggestions they can offer.
An open source Arabic test collection in my opinion would be pretty
beneficial to this project, its something personally I would love to
Ahmed, feel free to fill in with more details if you want.