Introduction
This repository contains a small, self-contained application for retrieving web pages, parsing their creation and modification dates, and printing various accompanying statistics.
The user interface consists of a simple HTML form for specifying the URLs and displaying statistics. It is available at http://localhost:3000/ once the application is running.
The general idea for the project was to try create a quick parser that is text-based (recognizing only minimal parts of HTML), and observe the performance and accuracy it would achieve when parsing HTML content.
Running the app
To run the application, install Crystal, run shards
to install dependencies, and then:
crystal [run --release] src/app.cr [-- options]
OR
shards build [--release]
./bin/app [options]
The application supports the following options:
-c, --connections (50) - Concurrent connections per host
-d, --downloaders (200) - Number of downloading threads
-h, --help - This help
-p, --parsers (10) - Number of date parsing threads
-v, --verbose (false) - Print a dot (.) on each download
Print a plus (+) on each parse
-C, --connect-timeout (5) - Set connect timeout (s)
-D, --dns-timeout (5) - Set DNS timeout (s)
-R, --read-timeout (5) - Set read timeout (s)
-W, --write-timeout (5) - Set write timeout (s)
-T, --timeout (5) - Set all timeouts (s)
Complete example:
git clone https://github.com/docelic/article_date
cd article_date
shards build --release
bin/app -c 50 -d 200
Usage
Once the app is started with the desired options, visit http://localhost:3000/ in the browser.
The minimal HTML user interface is provided by Kemal and it provides a textarea for entering page URLs, one per line.
Clicking "Submit" will process the data and display the results. A sort of real-time update is achieved by printing data to the response IO in real-time, allowing the browser to display result rows in incremental chunks instead of having to wait for all data to be processed.
Runtime and results
When the application starts, it will print a summary of the running configuration to STDOUT. Also, if option -v is provided, it will print '.' and '+' for each downloaded and parsed file.
As URLs are processed, each result row in the browser will display the following values:
- Sequential number following the ordering of URLs in the list, starting from 0
- Page URL (in case of a download error also with a copy of the error text)
- Parsed creation/modification date. If no date is determined, the value is empty
- Elapsed time for parsing the date (this value includes all Fiber wait times, but as methods invoked are generally non-blocking and execute without releasing the CPU, this value is considered to be close to real algorithm execution time)
- Elapsed time for downloading the page (this value includes all Fiber wait times, e.g. times waiting for web servers to respond as well as fibers to be scheduled on the CPU. As such it is regularly higher than the amount of time spent in actual execution)
- HTTP response status code
- Name of program method which determined the date
- The corresponding confidence store (0.0 - 1.0)
The footer of the table also contains 3 summarized values:
- Total real (wallclock) time
- Sum of all parsing times
- Sum of all download times
Wallclock time is useful for determining general/overall performance.
Parsing times report the actual times spent parsing and are useful for identifying potential needed improvements in the algorithms or on particular types of pages.
Download times, if very high, are useful for identifying that the thread settings (options -d and -p) may be suboptimal and could be adjusted. Alternatively if very low, the number of threads could be increased.
When processing of the URL list is complete, all open download connections and Fibers terminate. They are re-created on every request.
App design
The app is based on Fibers and Channels.
A group of N (--downloaders) Fibers works on the input URLs, processing each one while taking advantage of basic HTTP Keep-Alive implementation and maintaining at most N (--connections) HTTP::Clients open for each individual domain.
Parallel connections to the same host are not created up-front, but are instantiated only if needed to crawl multiple pages from the same domain simultaneously.
The app is intended to run N (--downloaders) download
fibers in parallel. However, if the input list is heavily sorted by
domain the performance may be reduced to N (--connections).
In such cases, either set options -d and -c to the same value or
randomize the input list (e.g. sort -R <file>
).
As each downloader downloads its page, it sends the intermediate data over the appropriate Channel to the parser processes, and then waits for the next page to download.
The parser processes receive downloaded data and try to determine the page creation or modification date using various parsing strategies. The current design of the parsing and extraction system is documented in the file PARSING.md.
As each parser finishes scanning through the page, it sends the final results and statistics to the results Channel and then waits for another page to parse.
In more general terms
The implemented design based on Channels, "downloaders" and "parsers" is chosen on the idea that a real-world, larger system could use a similar architecture on a larger scale.
For example, the downloader processes might be advanced clients capable of downloading JavaScript-heavy/SPA pages, re-parsing stored content instead of downloading it again, and/or using various APIs instead of getting data through crawling (e.g. search engines get data from Wikipedia via API, not downloading HTML).
These processes would then send contents via message passing or queueing systems for further processes down the line, of which date parsers could be just one type of consumer.
Improvements
In a more complete, non-prototype implementation, a couple improvements could be added:
-
More per-domain crawling limits and/or bandwidth caps
-
Keeping track of which parsing strategies had the best success rate on particular domains and/or subdirectories within domains. The order in which the parsing strategies are run could then be dynamically adjusted for best performance.