Name Matching API Documentation

A Flask-based REST API for real-time name matching classification for entity resolution and transaction monitoring.

Overview

The Name Matching API provides endpoints for classifying whether two names refer to the same entity. It uses a trained LightGBM classifier with multiple string similarity features including edit distance, Jaccard similarity, TF-IDF cosine similarity, and sentence embeddings.

Prerequisites

Before running the API, ensure you have:

Trained the model by running the full pipeline:

python -m name_matching.data.generate_names --n_persons 700 --n_orgas 300
python -m name_matching.data.make_dataset --n_neg 10
python -m name_matching.models.train_model --test-size 0.2 --thresh 0.85

Model files exist at:
- models/model_lgb_name_matching.pkl
- models/name_matching_tfidf_ngrams.pkl

Installation

Install the required dependencies:

pip install flask

All other dependencies should already be installed from the main project requirements.

Running the API

Development Mode

python app.py

The API will start on http://localhost:5001

Production Mode

For production deployment, use a WSGI server like Gunicorn:

# Install gunicorn
pip install gunicorn

# Run with gunicorn
gunicorn -w 4 -b 0.0.0.0:5001 app:app

Options:

-w 4: Number of worker processes (adjust based on CPU cores)
-b 0.0.0.0:5001: Bind to all interfaces on port 5001
--timeout 120: Request timeout in seconds

API Endpoints

1. Health Check

Check if the API is running.

Endpoint: GET /health

Response:

{
  "status": "healthy",
  "service": "name-matching-api"
}

2. Model Information

Get information about the loaded model.

Endpoint: GET /info

Response:

{
  "status": "success",
  "model": {
    "type": "LightGBM Classifier",
    "model_path": "models/model_lgb_name_matching.pkl",
    "tfidf_path": "models/name_matching_tfidf_ngrams.pkl",
    "features": [
      "JACCARD_SIM",
      "COSINE_SIM",
      "RATIO",
      "SORTED_TOKEN_RATIO",
      "TOKEN_SET_RATIO",
      "PARTIAL_RATIO",
      "EMB_DISTANCE",
      "LEN_DIFF"
    ],
    "num_features": 8
  }
}

3. Single Prediction

Classify a single name pair.

Endpoint: POST /predict

Request Body:

{
  "CUST_NAME": "John Smith",
  "COUNTERPART_NAME": "J. Smith",
  "FT_NO": "FT12345",
  "threshold": 0.85
}

Parameters:

CUST_NAME (required): First name (customer name)
COUNTERPART_NAME (required): Second name (counterpart name)
FT_NO (optional): Transaction reference number for tracking
threshold (optional): Classification threshold (default: 0.85)

Response:

{
  "status": "success",
  "result": {
    "ft_no": "FT12345",
    "name_x": "John Smith",
    "name_y": "J. Smith",
    "prediction": 1,
    "match_label": "MATCH",
    "probability": 0.9234,
    "threshold": 0.85,
    "features": {
      "JACCARD_SIM": 0.5,
      "COSINE_SIM": 0.8765,
      "RATIO": 0.7143,
      "SORTED_TOKEN_RATIO": 0.7143,
      "TOKEN_SET_RATIO": 0.7143,
      "PARTIAL_RATIO": 0.8,
      "EMB_DISTANCE": 0.9123,
      "LEN_DIFF": 3.0
    }
  }
}

Response Fields:

prediction: Binary prediction (0 = no match, 1 = match)
match_label: Human-readable label ("MATCH" or "NO_MATCH")
probability: Confidence score (0-1)
features: Individual feature values used for prediction

4. Batch Prediction

Classify multiple name pairs in a single request.

Endpoint: POST /predict/batch

Request Body:

{
  "pairs": [
    {
      "CUST_NAME": "John Smith",
      "COUNTERPART_NAME": "J. Smith",
      "FT_NO": "FT001"
    },
    {
      "CUST_NAME": "Apple Inc.",
      "COUNTERPART_NAME": "Apple Corporation",
      "FT_NO": "FT002"
    }
  ],
  "threshold": 0.85
}

Parameters:

pairs (required): List of name pair objects
threshold (optional): Classification threshold for all pairs (default: 0.85)

Response:

{
  "status": "success",
  "count": 2,
  "results": [
    {
      "ft_no": "FT001",
      "name_x": "John Smith",
      "name_y": "J. Smith",
      "prediction": 1,
      "match_label": "MATCH",
      "probability": 0.9234,
      "threshold": 0.85,
      "features": { ... }
    },
    {
      "ft_no": "FT002",
      "name_x": "Apple Inc.",
      "name_y": "Apple Corporation",
      "prediction": 1,
      "match_label": "MATCH",
      "probability": 0.8876,
      "threshold": 0.85,
      "features": { ... }
    }
  ]
}

Partial Success (207 Multi-Status):

If some predictions fail, the API returns a 207 status:

{
  "status": "partial_success",
  "message": "1 out of 2 predictions failed",
  "results": [
    { ... },
    {
      "error": "Validation error",
      "message": "Both name_x and name_y must be non-empty strings",
      "ft_no": "FT002"
    }
  ]
}

Usage Examples

Python with requests

import requests

# Single prediction
response = requests.post(
    "http://localhost:5001/predict",
    json={
        "CUST_NAME": "John Smith",
        "COUNTERPART_NAME": "J. Smith",
        "FT_NO": "FT12345",
        "threshold": 0.85
    }
)
result = response.json()
print(f"Match: {result['result']['match_label']}")
print(f"Probability: {result['result']['probability']}")

# Batch prediction
response = requests.post(
    "http://localhost:5001/predict/batch",
    json={
        "pairs": [
            {"CUST_NAME": "John Doe", "COUNTERPART_NAME": "J. Doe", "FT_NO": "FT001"},
            {"CUST_NAME": "Apple Inc", "COUNTERPART_NAME": "Apple Corp", "FT_NO": "FT002"}
        ],
        "threshold": 0.85
    }
)
results = response.json()
for result in results['results']:
    print(f"{result['ft_no']}: {result['match_label']} ({result['probability']:.2%})")

cURL

# Single prediction
curl -X POST http://localhost:5001/predict \
  -H "Content-Type: application/json" \
  -d '{
    "CUST_NAME": "John Smith",
    "COUNTERPART_NAME": "J. Smith",
    "FT_NO": "FT12345",
    "threshold": 0.85
  }'

# Batch prediction
curl -X POST http://localhost:5001/predict/batch \
  -H "Content-Type: application/json" \
  -d '{
    "pairs": [
      {"CUST_NAME": "John Doe", "COUNTERPART_NAME": "J. Doe", "FT_NO": "FT001"},
      {"CUST_NAME": "Apple Inc", "COUNTERPART_NAME": "Apple Corp", "FT_NO": "FT002"}
    ],
    "threshold": 0.85
  }'

JavaScript (fetch)

// Single prediction
const response = await fetch('http://localhost:5001/predict', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    CUST_NAME: 'John Smith',
    COUNTERPART_NAME: 'J. Smith',
    FT_NO: 'FT12345',
    threshold: 0.85
  })
});
const result = await response.json();
console.log(`Match: ${result.result.match_label}`);

Error Handling

Error Response Format

{
  "status": "error",
  "message": "Error description",
  "ft_no": "FT12345"
}

Common Error Codes

400 Bad Request: Invalid input data
- Missing required fields
- Invalid data types
- Invalid threshold value
404 Not Found: Endpoint does not exist
405 Method Not Allowed: Wrong HTTP method used
500 Internal Server Error: Server-side error
- Model loading failure
- Feature generation error
- Prediction error

Testing

Run Unit Tests

Test the prediction module:

pytest tests/unit_tests/test_predict_model.py -v

Run Integration Tests

Test the API endpoints:

pytest tests/integration_tests/test_api.py -v

Run All Tests

pytest tests/ -v

Performance Considerations

Model Loading: The model is loaded once at startup and cached for subsequent requests.
Batch Processing: Use the /predict/batch endpoint for multiple predictions to reduce overhead.
Threshold Tuning: Adjust the threshold based on your use case:
- Lower threshold (e.g., 0.5): Higher recall, more false positives
- Higher threshold (e.g., 0.9): Higher precision, more false negatives
Concurrency: Use a WSGI server with multiple workers for production deployments.

Configuration

Model paths and column names are configured in name_matching/config/Config.ini. To use different model paths, update:

[MODELPATH]
MODEL_LGB_NAME_MATCHING = models/model_lgb_name_matching.pkl
FILENAME_MODEL_TFIDF_NGRAM = models/name_matching_tfidf_ngrams.pkl

Monitoring and Logging

The API uses structlog for structured logging. All predictions and errors are logged with relevant metadata:

Request information (method, path, remote address)
Prediction requests (names, transaction reference)
Prediction results (match label, probability)
Errors (validation, type, prediction errors)

Troubleshooting

Model Not Found Error

FileNotFoundError: Model file not found

Solution: Train the model first using:

python -m name_matching.models.train_model

Empty Prediction Response

Solution: Check that both names are non-empty strings and properly formatted in the request.

Low Prediction Accuracy

Solution:

Retrain the model with more training data
Adjust the classification threshold
Check that input names are properly normalized

Support

For issues or questions:

Check the main project documentation in CLAUDE.md
Review test cases in tests/ for usage examples
Ensure all model files exist and are properly trained

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Name Matching API Documentation

Overview

Prerequisites

Installation

Running the API

Development Mode

Production Mode

API Endpoints

1. Health Check

2. Model Information

3. Single Prediction

4. Batch Prediction

Usage Examples

Python with requests

cURL

JavaScript (fetch)

Error Handling

Error Response Format

Common Error Codes

Testing

Run Unit Tests

Run Integration Tests

Run All Tests

Performance Considerations

Configuration

Monitoring and Logging

Troubleshooting

Model Not Found Error

Empty Prediction Response

Low Prediction Accuracy

Support

FilesExpand file tree

API_README.md

Latest commit

History

API_README.md

File metadata and controls

Name Matching API Documentation

Overview

Prerequisites

Installation

Running the API

Development Mode

Production Mode

API Endpoints

1. Health Check

2. Model Information

3. Single Prediction

4. Batch Prediction

Usage Examples

Python with requests

cURL

JavaScript (fetch)

Error Handling

Error Response Format

Common Error Codes

Testing

Run Unit Tests

Run Integration Tests

Run All Tests

Performance Considerations

Configuration

Monitoring and Logging

Troubleshooting

Model Not Found Error

Empty Prediction Response

Low Prediction Accuracy

Support