Skip to content

SOHAM-3T/sherlock-5gram-language-model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sherlock Holmes 5-gram Language Model

This repository contains an implementation of a 5-gram Language Model trained on the works of Arthur Conan Doyle (specifically, a collection of Sherlock Holmes stories).

This project was built as an assignment to demonstrate N-gram language modeling and text generation in the style of a chosen author.

Author Choice: Arthur Conan Doyle

I have chosen Arthur Conan Doyle as the author for this language model.

Why Arthur Conan Doyle?

  1. Personal Favorite: Arthur Conan Doyle is my favorite author. I admire the way he constructs narratives that are both intellectually stimulating and deeply atmospheric.
  2. Engaging Style: The Sherlock Holmes stories are renowned for their suspense, intricate plotting, and the unique voice of Dr. Watson as the narrator. This makes the generated text interesting to analyze.
  3. Rich Vocabulary: Doyle's writing features a sophisticated and distinct vocabulary ("elementary", "deduction", "singular", "features"), which provides excellent training data for an N-gram model to capture stylistic nuances.

Codebase Overview

The project is structured as follows:

  • src/: Contains the source code.
    • model.py: Defines the NGramLanguageModel class.
      • preprocess(text): Cleans the input text (converts to lowercase, removes non-alphabetic characters except periods).
      • train(text): Builds the N-gram probability model by counting the frequency of word sequences using a dictionary of dictionaries.
      • generate(seed_text, max_words): Generates new text based on a seed prompt. It uses the last $N-1$ words to predict the next word based on the learned probabilities.
    • main.py: The entry point of the application.
      • Loads the training data from data/sherlock.txt.
      • Initializes the NGramLanguageModel with $n=5$.
      • Trains the model on the corpus.
      • Generates text for specific sample prompts.
  • data/: Contains the training corpus.
    • sherlock.txt: The text file containing the Sherlock Holmes stories used for training.

How to Run

  1. Ensure you have Python installed.
  2. Navigate to the project root directory.
  3. Run the main script:
python src/main.py

Sample Outputs

Here are some example outputs generated by the model (n=5):

Sample 1

  • Input: "the day was very"
  • Output: "the day was very asked."

Sample 2

  • Input: "it was a cold"
  • Output: "it was a cold which you for then.he me.rder room brought shrank was"

Sample 3

  • Input: "holmes looked at me"
  • Output: "holmes looked at me with a singular expression on his face."

Note: The model uses a probabilistic approach, so outputs may vary slightly each time the code is run.

About

A statistical 5-gram language model implemented in Python, trained on Sherlock Holmes stories by Arthur Conan Doyle (Project Gutenberg) to generate text in the author’s writing style.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages