Practical NLP Techniques: Word Embeddings, Transformers, and Chatbots
Understanding Word Embeddings and Semantic Similarity
Word embeddings are numerical representations of words that capture their semantic meaning and relationships. These techniques are fundamental in Natural Language Processing (NLP), enabling machines to understand and process human language more effectively. The following examples demonstrate how to work with word embeddings using Gensim and custom implementations.
Gensim Word Vectors: Semantic Relationships
Gensim is a popular Python library for topic modeling and word embedding. Here, we load pre-trained GloVe word vectors and use them to find words that complete analogies, showcasing how word vectors capture semantic relationships like “king – man + woman = queen”.
from gensim.models import KeyedVectors
from gensim.downloader import load
# Load pre-trained GloVe word vectors (glove-wiki-gigaword-100)
# This model contains 100-dimensional vectors trained on Wikipedia and Gigaword data.
word_vectors = load('glove-wiki-gigaword-100')
# Example 1: Animal relationship (kitten → cat, puppy → dog)
# This operation attempts to find a word that relates to 'dog' in the same way 'kitten' relates to 'cat'.
# Conceptually: vector('kitten') - vector('cat') + vector('dog')
result = word_vectors.most_similar(positive=['kitten', 'dog'], negative=['cat'], topn=1)
print(f"Result of 'kitten - cat + dog' is: {result[0][0]}")
# Example 2: Fruit relationship (orange → fruit, mango → tropical fruit)
# This operation attempts to find a word that relates to 'tropical' in the same way 'orange' relates to 'fruit'.
# Conceptually: vector('orange') - vector('fruit') + vector('tropical')
result = word_vectors.most_similar(positive=['orange', 'tropical'], negative=['fruit'], topn=1)
print(f"Result of 'orange - fruit + tropical' is: {result[0][0]}")
These examples illustrate the power of word embeddings in capturing nuanced semantic relationships between words, allowing for tasks like analogy completion.
Visualizing Word Embeddings with t-SNE
Visualizing high-dimensional word embeddings can provide insights into their structure. t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensionality reduction technique particularly well-suited for visualizing high-dimensional datasets in 2D or 3D, preserving local structures.
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from gensim.downloader import load
import numpy as np
# Load pre-trained GloVe word vectors
word_vectors = load('glove-wiki-gigaword-100')
# Define a list of technology-related words
tech_words = ['computer', 'internet', 'software', 'hardware', 'network', 'data', 'cloud', 'robot', 'algorithm', 'technology']
# Filter words to ensure they exist in the loaded vocabulary
tech_words = [word for word in tech_words if word in word_vectors.key_to_index]
# Extract vectors for the selected words
vectors = np.array([word_vectors[word] for word in tech_words])
# Apply t-SNE to reduce dimensions to 2D
# perplexity relates to the number of nearest neighbors used in the algorithm.
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
reduced_vectors = tsne.fit_transform(vectors)
# Plot the reduced vectors
plt.figure(figsize=(10, 6))
for i, word in enumerate(tech_words):
plt.scatter(reduced_vectors[i, 0], reduced_vectors[i, 1], label=word)
# Annotate points with word labels
plt.text(reduced_vectors[i, 0] + 0.02, reduced_vectors[i, 1] + 0.02, word, fontsize=9)
plt.title("t-SNE Visualization of Technology Words")
plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.legend()
# plt.show() # Uncomment to display the plot in an interactive environment
# Find and print words similar to 'computer'
input_word = 'computer'
if input_word in word_vectors.key_to_index:
similar_words = word_vectors.most_similar(input_word, topn=5)
print(f"\n5 words similar to '{input_word}':")
for word, similarity in similar_words:
print(f"{word} (similarity: {similarity:.2f})")
else:
print(f"'{input_word}' is not in the vocabulary.")
The t-SNE plot helps visualize clusters of semantically related words, while the most_similar
function quantifies these relationships by listing the closest words in the embedding space.
Training Custom Word2Vec Models
While pre-trained models are useful, you might need to train your own Word2Vec model on a specific corpus to capture domain-specific nuances. This example demonstrates training a simple Word2Vec model on medical data.
from gensim.models import Word2Vec
# Sample medical data (list of sentences/documents)
medical_data = [
["patient", "doctor", "nurse", "hospital", "treatment"],
["cancer", "chemotherapy", "radiation", "surgery", "recovery"],
["infection", "antibiotics", "diagnosis", "disease", "virus"],
["heart", "disease", "surgery", "cardiology", "recovery"]
]
# Train a Word2Vec model
# sentences: The corpus to train on.
# vector_size: Dimensionality of the word vectors.
# window: Maximum distance between the current and predicted word within a sentence.
# min_count: Ignores all words with total frequency lower than this.
# workers: Use these many worker threads to train the model.
# epochs: Number of iterations (epochs) over the corpus.
model = Word2Vec(sentences=medical_data, vector_size=100, window=2,
min_count=1, workers=1, epochs=50)
# Find and print words similar to 'patient'
input_word = "patient"
if input_word in model.wv: # model.wv contains the word vectors
similar_words = model.wv.most_similar(input_word, topn=3)
print(f"3 words similar to '{input_word}':")
for word, similarity in similar_words:
print(f"{word} (similarity: {similarity:.2f})")
else:
print(f"'{input_word}' is not in the vocabulary.")
Training a custom Word2Vec model allows you to create embeddings tailored to your specific dataset, which can be crucial for specialized NLP tasks.
Custom Word Embeddings for Prompt Enrichment
Beyond pre-trained models, you can define custom word relationships for specific applications, such as enriching prompts for generative AI models. This simple dictionary-based approach demonstrates how to expand a prompt with related terms.
# A dictionary representing custom word embeddings or synonyms
word_embeddings = {
"ai": ["machine learning", "deep learning", "data science"],
"data": ["information", "dataset", "analytics"],
"science": ["research", "experiment", "technology"],
"learning": ["education", "training", "knowledge"],
"robot": ["automation", "machine", "mechanism"]
}
def find_similar_words(word):
"""
Finds similar words based on the custom word_embeddings dictionary.
"""
return word_embeddings.get(word, [])
def enrich_prompt(prompt):
"""
Enriches a given prompt by adding similar words in parentheses.
Example: "ai data" -> "ai (machine learning, deep learning, data science) data (information, dataset, analytics)"
"""
words = prompt.lower().split()
enriched_words = []
for word in words:
similar_words = find_similar_words(word)
if similar_words:
enriched_words.append(f"{word} ({', '.join(similar_words)})")
else:
enriched_words.append(word)
return " ".join(enriched_words)
# Example usage:
sample_prompt = "Explore ai and data science"
enriched_output = enrich_prompt(sample_prompt)
print(f"Original Prompt: '{sample_prompt}'")
print(f"Enriched Prompt: '{enriched_output}'")
This method provides a straightforward way to inject more context or related keywords into a prompt, potentially leading to more comprehensive or relevant outputs from language models.
Generating Creative Text with Word Embeddings
Custom word embeddings can also be used in creative text generation, where a “seed” word can trigger the inclusion of related terms to build a narrative. This example generates a simple paragraph based on a chosen seed word and its associated terms.
# A dictionary of seed words and their associated similar words
word_embeddings = {
"adventure": ["journey", "exploration", "quest"],
"robot": ["machine", "automation", "mechanism"],
"forest": ["woods", "jungle", "wilderness"],
"ocean": ["sea", "waves", "depths"],
"magic": ["spell", "wizardry", "enchantment"]
}
def get_similar_words(seed_word):
"""
Retrieves similar words for a given seed word from the custom embeddings.
"""
return word_embeddings.get(seed_word, ["No similar words found"])
def create_paragraph(seed_word):
"""
Generates a short paragraph using the seed word and its similar words.
"""
similar_words = get_similar_words(seed_word)
if "No similar words found" in similar_words:
return f"Sorry, I couldn't find similar words for '{seed_word}'."
paragraph = (
f"Once upon a time, there was a great {seed_word}. "
f"It was full of {', '.join(similar_words[:-1])}, and {similar_words[-1]}. "
f"Everyone who experienced this {seed_word} always remembered it as a remarkable tale."
)
return paragraph
# Example usage:
seed_word = "adventure"
story = create_paragraph(seed_word)
print("Generated Paragraph:")
print(story)
seed_word_2 = "robot"
story_2 = create_paragraph(seed_word_2)
print("\nGenerated Paragraph (Robot):")
print(story_2)
This demonstrates a basic approach to procedural text generation, where predefined semantic relationships contribute to the narrative’s richness.
Practical Applications of NLP with Modern Libraries
Modern NLP libraries like Hugging Face Transformers and LangChain provide powerful tools for implementing complex NLP tasks with minimal code. These examples showcase sentiment analysis and text summarization.
Sentiment Analysis with Hugging Face Transformers
Sentiment analysis is the process of determining the emotional tone behind a piece of text. The Hugging Face pipeline
function offers a straightforward way to perform this task using pre-trained models.
from transformers import pipeline
# Initialize the sentiment analysis pipeline
# This downloads a default pre-trained model for sentiment analysis.
sentiment_analyzer = pipeline("sentiment-analysis")
# List of sentences to analyze
sentences = [
"I love using this product! It makes my life so much easier.",
"The service was terrible, and I'm very disappointed.",
"It's an average experience, nothing special but not bad either."
]
# Perform sentiment analysis for each sentence
for sentence in sentences:
result = sentiment_analyzer(sentence)[0] # The pipeline returns a list of dictionaries
print(f"Sentence: {sentence}")
print(f"Sentiment: {result['label']} (Score: {result['score']:.2f})\n")
The Transformers library simplifies the application of state-of-the-art models for tasks like sentiment analysis, making advanced NLP accessible to developers.
Text Summarization with Hugging Face Transformers
Text summarization condenses a longer text into a shorter version while retaining the main points. The Hugging Face summarization
pipeline can be used for this purpose.
from transformers import pipeline
# Initialize the summarization pipeline
summarizer = pipeline("summarization")
# Long text to be summarized
long_text = """
Artificial Intelligence (AI) is transforming various industries by automating tasks, improving
efficiency, and enabling new capabilities. In the healthcare sector, AI is used for disease diagnosis,
personalized medicine, and drug discovery. In the business world, AI-powered systems are optimizing customer
service, fraud detection, and supply chain management. AI's impact on everyday life is significant, from smart
assistants to recommendation systems in streaming platforms. As AI continues to evolve, it promises even greater
advancements in fields like education, transportation, and environmental sustainability.
"""
# Generate a summary
# max_length: Maximum length of the generated summary.
# min_length: Minimum length of the generated summary.
# do_sample: Whether to use sampling; False for greedy decoding.
summary = summarizer(long_text, max_length=50, min_length=20,
do_sample=False)[0]["summary_text"]
print("Summarized Text:")
print(summary)
This example demonstrates how easily complex tasks like summarization can be performed using pre-trained models from the Hugging Face ecosystem.
Advanced Text Summarization with LangChain and Cohere
LangChain is a framework designed to simplify the creation of applications powered by large language models (LLMs). This example shows how to use LangChain with a Cohere LLM to summarize text from a file.
from langchain_community.llms import Cohere
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
# from google.colab import drive # Uncomment if running in Google Colab and need to mount drive
# Path to the text file to be summarized
# Ensure 'crow.txt' exists in the specified path with content.
file_path = "/content/crow.txt"
# Example content for crow.txt if not available:
# text = """
# The crow, a common bird found across the globe, is known for its intelligence and adaptability.
# Crows are highly social animals, often living in large family groups. They exhibit complex problem-solving
# abilities, including the use of tools and understanding of cause and effect. Their diet is omnivorous,
# consuming everything from insects and seeds to carrion and human scraps. Crows also possess remarkable
# vocalizations, capable of mimicking various sounds and communicating through a wide range of calls.
# """
try:
with open(file_path, "r") as file:
text = file.read()
except FileNotFoundError:
text = "The crow is a highly intelligent bird known for its problem-solving abilities and complex social structures. They are omnivorous and can mimic sounds."
print(f"'{file_path}' not found. Using placeholder text for demonstration.")
# IMPORTANT: Replace with your actual Cohere API key.
# For production, use environment variables or a secure secret management system.
cohere_api_key = "YOUR_COHERE_API_KEY" # Placeholder for security
# Define a prompt template for summarization
prompt_template = PromptTemplate(
input_variables=["text"],
template="""
Summarize the following text in two bullet points:
{text}
"""
)
# Initialize the Cohere LLM
llm = Cohere(cohere_api_key=cohere_api_key)
# Create an LLMChain to combine the prompt and the LLM
chain = LLMChain(llm=llm, prompt=prompt_template)
# Run the chain with the text
result = chain.run(text)
print("Original Text:\n")
print(text)
print("\nSummarized Output in Bullet Points:\n")
print(result)
LangChain streamlines the integration of LLMs into applications, allowing for more complex workflows like reading from files and applying specific prompt formats for summarization.
Building a Basic Question-Answering Chatbot
A simple chatbot can be implemented using a predefined set of question-answer pairs. This example demonstrates a basic interactive chatbot that provides information based on user input.
Simple IPC Chatbot Implementation
This Python script creates a rudimentary chatbot that answers questions about specific sections of the Indian Penal Code (IPC) based on a static dictionary of Q&A pairs.
# Predefined question-answer pairs for the IPC chatbot
qa_pairs = {
"1": ("What is Section 302 of IPC?", "Punishment for murder: death or life imprisonment."),
"2": ("What does Section 420 say?", "Cheating and dishonestly inducing delivery of property. Up to 7 years + fine."),
"3": ("Explain Section 375.", "Defines rape and the conditions under which it is considered rape.")
}
print("IPC Chatbot — Type a number (1-3) to get information or 'exit' to quit:")
while True:
# Display available questions
for k, v in qa_pairs.items():
print(f"{k}. {v[0]}")
choice = input("\nYour choice: ").strip()
if choice.lower() == "exit":
print("Goodbye!")
break
else:
# Retrieve and print the answer, or an "Invalid choice" message
# The .get() method allows providing a default value if the key is not found.
# We access the second element of the tuple, which is the answer.
print(f"\n{qa_pairs.get(choice, ('', 'Invalid choice. Please enter 1, 2, 3, or exit.'))[1]}\n")
This basic chatbot illustrates the concept of rule-based question answering, a foundational element in conversational AI systems.