wikiscraper
Parses and returns text of the given Wikipedia article broken up into sections.
This is intended to be deployed as a web service. A live example can be found at https://ajh-wikiscraper.herokuapp.com/?url=https://en.wikipedia.org/wiki/Pet_door. The wikiscraper accepts HTTP GET and POST requests.
Example inputs
url=https://en.wikipedia.org/wiki/Pet_door
Set "type" to "html" to return the unparsed HTML text.
url=https://en.wikipedia.org/wiki/Pet_door
type=html
Set "type" to "wikitext" to return the unparsed wikitext.
url=https://en.wikipedia.org/wiki/Pet_door
type=wikitext
Example outputs
{
"data": {
"contents": [
{"number": 0, "title": "Pet door", "content": "A pet door or pet..."},
{"number": 1, "title": "Purpose", "content": "A pet door is found..."},
...
]
},
}
With type set to "html"
{
"data": {"contents": ["<div class=\"mw-parser-output\">..."]},
}
With type set to "wikitext"
{
"data": {"contents": ["[[File:Doggy door exit.JPG|thumb|A dog..."]},
}
When the url is missing
{
"error": "A valid Wikipedia URL must be passed."
}
When the article doesn't exist
{
"error": "The page you specified doesn't exist."
}
Development
shards install
crystal run src/wikiscraper_web.cr
Contributing
- Fork it (https://github.com/your-github-user/wikiscraper/fork)
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request