Hello there, really exciting one you have here, since you are new to the NLP domain, let me give you a skeleton approach so that the problem can be better understood.
Approach:
Data Collection:
Utilize the provided text corpus, including news articles, books, Wikipedia articles, and social media comments, totaling approximately 10 million lines of text data.
Annotated Data:
Manually annotate a subset of the corpus for training data, including question-answer pairs.
Develop a specific annotation schema for Azerbaijani language questions and answers.
Implement Named Entity Recognition (NER) tagging for key entities in the text.
Model Architecture:
Build a custom language model using spaCy's infrastructure, incorporating transfer learning from a base model (e.g., a similar language, if available).
Train the model on the annotated data using deep learning techniques, using architectures like LSTM, Transformer, or BERT.
Evaluation:
Implement rigorous evaluation metrics to assess the model's performance, such as F1-score, BLEU, and ROUGE.
Fine-tune the model iteratively based on evaluation results.
Integration with spaCy:
Develop a spaCy pipeline component for Azerbaijani language support.
Enable the model for various NLP tasks, such as Named Entity Recognition (NER), part-of-speech tagging, and question-answering.
Ensure compatibility with spaCy's existing features and libraries.