Sherlock Holmes 5-gram Language Model

This repository contains an implementation of a 5-gram Language Model trained on the works of Arthur Conan Doyle (specifically, a collection of Sherlock Holmes stories).

This project was built as an assignment to demonstrate N-gram language modeling and text generation in the style of a chosen author.

Author Choice: Arthur Conan Doyle

I have chosen Arthur Conan Doyle as the author for this language model.

Why Arthur Conan Doyle?

Personal Favorite: Arthur Conan Doyle is my favorite author. I admire the way he constructs narratives that are both intellectually stimulating and deeply atmospheric.
Engaging Style: The Sherlock Holmes stories are renowned for their suspense, intricate plotting, and the unique voice of Dr. Watson as the narrator. This makes the generated text interesting to analyze.
Rich Vocabulary: Doyle's writing features a sophisticated and distinct vocabulary ("elementary", "deduction", "singular", "features"), which provides excellent training data for an N-gram model to capture stylistic nuances.

Codebase Overview

The project is structured as follows:

src/: Contains the source code.
- model.py: Defines the NGramLanguageModel class.
  - preprocess(text): Cleans the input text (converts to lowercase, removes non-alphabetic characters except periods).
  - train(text): Builds the N-gram probability model by counting the frequency of word sequences using a dictionary of dictionaries.
  - generate(seed_text, max_words): Generates new text based on a seed prompt. It uses the last $N-1$ words to predict the next word based on the learned probabilities.
- main.py: The entry point of the application.
  - Loads the training data from data/sherlock.txt.
  - Initializes the NGramLanguageModel with $n=5$.
  - Trains the model on the corpus.
  - Generates text for specific sample prompts.
data/: Contains the training corpus.
- sherlock.txt: The text file containing the Sherlock Holmes stories used for training.

How to Run

Ensure you have Python installed.
Navigate to the project root directory.
Run the main script:

python src/main.py

Sample Outputs

Here are some example outputs generated by the model (n=5):

Sample 1

Input: "the day was very"
Output: "the day was very asked."

Sample 2

Input: "it was a cold"
Output: "it was a cold which you for then.he me.rder room brought shrank was"

Sample 3

Input: "holmes looked at me"
Output: "holmes looked at me with a singular expression on his face."

Note: The model uses a probabilistic approach, so outputs may vary slightly each time the code is run.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sherlock Holmes 5-gram Language Model

Author Choice: Arthur Conan Doyle

Why Arthur Conan Doyle?

Codebase Overview

How to Run

Sample Outputs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sherlock Holmes 5-gram Language Model

Author Choice: Arthur Conan Doyle

Why Arthur Conan Doyle?

Codebase Overview

How to Run

Sample Outputs

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages