Key Techniques in Web Scraping and Text Processing
Web Scraping and Data Extraction
Core Web Technologies
A web page is built using several core technologies:
- HTML (HyperText Markup Language): Defines the content and structure of a web page. It is composed of tags organized in a tree-like structure.
- CSS (Cascading Style Sheets): Controls the design and presentation of a web page.
- JavaScript: Enables interactive actions and dynamic content on a web page.
Scraping Methods
Static Web Page Scraping
For static pages, you can use libraries like Beautiful Soup in Python. This tool parses HTML and allows you to find content based on tags and attributes.
Dynamic Web Page Scraping
For dynamic pages that load content using JavaScript, a simple library like Beautiful Soup is not enough. You need to use a tool like Selenium, which can automate a web browser and interact with the page (e.g., clicking buttons, scrolling) to load all content before scraping.
Using APIs
An API (Application Programming Interface) acts like a waiter in a restaurant: it receives a request from a client (you) and delivers it to the server, then returns the server’s response. When available, it is better to use a legally provided API for data extraction.
Web Scraping Considerations
Be cautious when scraping. Always check if you have the authority to scrape a website (e.g., by reading its robots.txt
file). To avoid overloading the server, it is crucial to control your request rate, for instance, by using time.sleep()
between requests.
Regular Expressions for Pattern Matching
A Regular Expression (Regex) is a sequence of characters that specifies a search pattern in text. It is used to find text that matches a specific pattern, employing special characters called metacharacters.
Key Concepts
- Metacharacters: Special characters with a specific meaning in regex (e.g.,
*
,+
,?
,^
,$
). - Escaping: If you need to match a character that is also a metacharacter, you must use a backslash (
\
) in front of it to escape its special meaning.
Common Functions in Python’s `re` Module
match()
: Checks for a match only at the beginning of the string.search()
: Scans through the string, looking for the first location where the pattern produces a match.findall()
: Finds all substrings where the pattern matches and returns them as a list.finditer()
: Similar tofindall()
, but returns an iterator yielding match objects.sub()
: Replaces one or many matches with a string.
Creating and Using Patterns
You can pre-compile a pattern for efficiency using re.compile()
. For example: pattern = re.compile("ab", re.M)
, where re.M
or re.MULTILINE
allows matching at the beginning of each line (not just the string). Common anchors include ^
for the start of a string and $
for the end.
Greedy Matching
By default, regex is greedy, meaning it tries to match the longest possible string. For example, in the string AB12efCefGH
:
- The pattern
[a-zA-Z0-9]ef[a-zA-Z0-9]
will only matchCefG
. The first potential match,2efC
, is consumed, and the engine continues from there. - The pattern
[a-zA-Z0-9]+ef[a-zA-Z0-9]
will matchAB12efCefG
because the+
quantifier (one or more) is greedy and extends the initial part of the match as far as possible.
Lookarounds
- Positive Lookahead
(?=...)
: Asserts that the subpattern inside the parentheses must match after the current position, without consuming any characters. For example,.(?=efg)
matches a single character that is immediately followed by “efg”. - Positive Lookbehind
(?<=...)
: Asserts that the subpattern inside the parentheses must match before the current position, without consuming any characters. For example,(?<=abc).
matches a single character that is immediately preceded by “abc”.
Text Preprocessing
Text preprocessing involves cleaning and preparing text data for analysis. Key steps include:
- Tokenization: Breaking down a text corpus into smaller units like sentences or words (tokens).
- Part-of-Speech (POS) Tagging: Assigning a grammatical category (e.g., noun, verb) to each token.
- Normalization: Standardizing words, such as converting different forms with the same meaning into a single canonical form.
- Cleaning: Removing noise from the text, such as HTML tags or special characters.
- Stopword Removal: Removing common words (e.g., “the”, “is”, “a”) that provide little semantic value.
- Lemmatization: Reducing words to their base or dictionary form (lemma), considering the word’s context and part of speech.
- Stemming: A simpler, rule-based process of reducing words to their root form by chopping off endings.
Text Representation
This is the process of converting text into numerical vectors that machine learning models can understand.
Word-Level Representation
- One-Hot Encoding: Represents words as sparse vectors where one element is 1 and all others are 0. Does not capture semantic relationships.
- Word Embedding: Dense vector representations learned from data that capture semantic relationships between words. Common models include Word2vec, FastText, and GloVe.
Document-Level Representation
- Bag-of-Words (BoW): Represents a document by counting the occurrences of each word, disregarding grammar and word order.
- Document-Term Matrix (DTM): A matrix where rows represent documents and columns represent words, with each cell containing the word count. This often results in a large, sparse matrix.
- TF-IDF (Term Frequency-Inverse Document Frequency): A numerical statistic that reflects how important a word is to a document in a collection or corpus.
- Document Embeddings: Models like Doc2vec or Sent2vec create a single vector for an entire document or sentence.
Handling Vocabulary
- Integer Encoding: After tokenization, each unique word in the vocabulary is assigned a unique integer.
- Out-of-Vocabulary (OOV): Words that appear during testing but were not in the training vocabulary are marked as unknown (UNK).
- Padding: A technique to make all sequences in a batch have the same length by adding a special token (e.g., 0) to shorter sequences.
TF-IDF Explained
TF-IDF assigns a weight to each word based on its importance. The formula is TF-IDF = TF * IDF.
- Term Frequency (TF): The frequency of a word t in a document d.
tf(d,t)
- Inverse Document Frequency (IDF): Measures how much information the word provides. It is the logarithm of the total number of documents divided by the number of documents containing the word.
idf(t) = log(N / (df(t) + 1))
, where N is the total number of documents and df(t) is the number of documents containing term t. The logarithm is used to dampen the effect of high document frequencies.
Measuring Vector Similarity
Once text is converted to vectors, their similarity can be calculated:
- Cosine Similarity: Measures the cosine of the angle between two vectors. It focuses on orientation, not magnitude. Formula:
cos(θ) = (u · v) / (||u|| * ||v||)
- Euclidean Distance: The straight-line distance between two points (vectors) in Euclidean space. Formula:
dist = ||v - u|| = sqrt((v1 - u1)² + ... + (vn - un)²)
Language Models (LM)
A language model assigns a probability to a sequence of words. A good LM can generate fluent, natural-sounding text.
Statistical Language Models
These models use conditional probability to predict the next word given the previous words. The probability of a sequence of words W is calculated as: P(W) = P(w1, w2, ..., wn) = Π P(wi | w1, ..., wi-1)
. A major challenge is the sparsity problem, where many valid sequences may not appear in the training data, leading to zero probabilities.
N-gram Language Models
To simplify the problem, n-gram models approximate the probability of a word by looking only at the previous n-1 words. Common types include:
- Bigram (n=2): Predicts a word based on the single preceding word.
P(wi | wi-1)
- Trigram (n=3): Predicts a word based on the two preceding words.
P(wi | wi-2, wi-1)
N-gram models still face data sparsity issues, which can be mitigated with techniques like smoothing and backoff.
Evaluating Language Models: Perplexity (PPL)
Perplexity is a measure of how well a probability model predicts a sample. A lower PPL indicates a better model. It is the Nth root of the inverse of the geometric mean of the probabilities of the words in a sequence. Formula: PPL(W) = P(w1, w2, ..., wn)^(-1/N)
Topic Modeling
Topic modeling is an unsupervised technique for discovering abstract “topics” that occur in a collection of documents. Common methods include:
- Latent Semantic Analysis (LSA): Uses Singular Value Decomposition (SVD) to reduce the dimensionality of the document-term matrix.
- Latent Dirichlet Allocation (LDA): A probabilistic generative model that assumes documents are a mixture of topics, and topics are a mixture of words.
Text Summarization
Extractive vs. Abstractive Summarization
- Extractive Summarization: Selects important words, phrases, or sentences directly from the source text to form the summary.
- Abstractive Summarization: Generates new phrases and sentences, paraphrasing the source text like a human would.
Graph-Based Methods: TextRank
Inspired by PageRank, TextRank represents a document as a graph where nodes can be sentences or words. It extracts key sentences or keywords by identifying the most important nodes in the graph based on their connections.
Sentiment Analysis
Sentiment analysis identifies and categorizes opinions expressed in a piece of text. It can be performed at different levels: document, sentence, or aspect.
Lexicon-Based Approach
This method uses a sentiment dictionary (lexicon) containing words and their associated sentiment scores (positive, negative, neutral). The overall sentiment of a text is determined by aggregating the scores of the words it contains.
Machine Learning-Based Approach
This approach trains a model on a dataset where texts are labeled with their sentiment (e.g., positive or negative). The model then learns to predict the sentiment of new, unlabeled texts.
Introduction to Neural Networks
The Perceptron
A perceptron is the simplest form of a neural network, a single neuron that takes multiple binary inputs (x1, x2, …), applies weights (w1, w2, …), and produces a single binary output. The output is determined by whether the weighted sum of inputs plus a bias exceeds a certain threshold.
Logic Gates with a Perceptron
- A single perceptron can represent AND, NAND, and OR functions.
- It cannot represent the XOR function because XOR is not linearly separable. However, XOR can be represented by combining multiple perceptrons (e.g., in a multi-layer network).
Feed-Forward Neural Networks (FFNN)
An FFNN is a network where connections between nodes do not form a cycle. Information moves in only one direction: from the input nodes, through the hidden layers, to the output nodes.
Multi-Layer Perceptron (MLP)
An MLP is a class of FFNN with one or more hidden layers between the input and output layers. These hidden layers, combined with non-linear activation functions, allow the network to learn complex, non-linear patterns.