This repository contains an implementation of a 5-gram Language Model trained on the works of Arthur Conan Doyle (specifically, a collection of Sherlock Holmes stories).
This project was built as an assignment to demonstrate N-gram language modeling and text generation in the style of a chosen author.
I have chosen Arthur Conan Doyle as the author for this language model.
- Personal Favorite: Arthur Conan Doyle is my favorite author. I admire the way he constructs narratives that are both intellectually stimulating and deeply atmospheric.
- Engaging Style: The Sherlock Holmes stories are renowned for their suspense, intricate plotting, and the unique voice of Dr. Watson as the narrator. This makes the generated text interesting to analyze.
- Rich Vocabulary: Doyle's writing features a sophisticated and distinct vocabulary ("elementary", "deduction", "singular", "features"), which provides excellent training data for an N-gram model to capture stylistic nuances.
The project is structured as follows:
-
src/: Contains the source code.-
model.py: Defines theNGramLanguageModelclass.-
preprocess(text): Cleans the input text (converts to lowercase, removes non-alphabetic characters except periods). -
train(text): Builds the N-gram probability model by counting the frequency of word sequences using a dictionary of dictionaries. -
generate(seed_text, max_words): Generates new text based on a seed prompt. It uses the last$N-1$ words to predict the next word based on the learned probabilities.
-
-
main.py: The entry point of the application.- Loads the training data from
data/sherlock.txt. - Initializes the
NGramLanguageModelwith$n=5$ . - Trains the model on the corpus.
- Generates text for specific sample prompts.
- Loads the training data from
-
-
data/: Contains the training corpus.-
sherlock.txt: The text file containing the Sherlock Holmes stories used for training.
-
- Ensure you have Python installed.
- Navigate to the project root directory.
- Run the main script:
python src/main.pyHere are some example outputs generated by the model (n=5):
Sample 1
- Input: "the day was very"
- Output: "the day was very asked."
Sample 2
- Input: "it was a cold"
- Output: "it was a cold which you for then.he me.rder room brought shrank was"
Sample 3
- Input: "holmes looked at me"
- Output: "holmes looked at me with a singular expression on his face."
Note: The model uses a probabilistic approach, so outputs may vary slightly each time the code is run.