Our situation is this: We process 200,000 xml files per day. We have been storing these in zip files isnce 2001. I am expecting the application will consits of two parts Part 1 : Lucene indexer that indexes new files received every 15 minutes (to allow me to search for transactions received between 2001 and last 15 mins) Part 2 : Front end portal (MVC)
## Deliverables
We need to build a search portal (using jboss maybe) to allow the following: 1) Xpath based search: I think you'll need to index entire XMLs (as DOM?) such that I can type an xpath query and value and the portal returns us the xml files (name and content) that match the criteria. 2) Capacity Analysis: The average size of our XML is 5k. I would like to know how much space will be taken up by the index. 3) I also need some analysis to show if we should be storing the XML files in an open source database, or is storing on the filesystem (as we currently do) good enough 4) Scalability: I would like the software to be able to handle 50 concurrent users (hardware is not a problem) and respond in under 2 seconds. 5) Good quality documentation: I will supply specs and sample XMLs etc. and require the software to be well documented, so we can extend it in future.
## Platform
Java platform and open source software. I am expecting this application to run on windows and HP-UX.