Your proposal could be hard to achieve, but I can propose you the following plan: first of all, you are talking about documents to be parsed in some way. Unfortunately, PDF documents are designed to be printed, not parsed, so extracting information from them is pretty hard.
1. The first step could be converting all the PDF documents in another format (preferably plain text). This stes should be applied also to DOC files;
2. the second step could be extracting running a word count script on all the text files to obtain a mapping <file, word, occurrence>;
3. remove useless words from this mapping (like stop-words);
4. create a table in which a row is a document, and there is a column for each word (considering the union of all the extracted words), so the single cell contains the frequence of the specific word in the document (this is 0 if the current word does not exist in the document);
5. now you have a series of "points" in a multidimensional space, so you could potentially apply different clustering algorithms;
6. the idea is to add new rows in the matrix, which are new documents, and see what is the cluster to which they are assigned (the documents in the same cluster are the most similar documents).
If you already have some kind of class for each document, you could also use a supervised classification algorithm to obtain the same (and maybe a more accurate) result.
I can complete the task in a couple of days.
Best,
Fabio