Web Scraper Data Tool

Tuesday, March 13, 2018

By: Chris Dunn

A number of years ago I was migrating and merging a series of websites into a Content Management System.  We were attempting to organize a number of mixed technology sites, some hard-coded html, others ASP or PHP, under a single CMS.  Rather than manually translate each page into the new system, I decided to try parsing sites that followed a template pattern and import sections into the properties of the new system pages.  I developed a web scraping tool to extract that data based on a data map to grab content within elements using CSS selectors.  It wasn't perfect but did save our team a remarkable about of manual work.

Since then, I've come across other scenarios where such a tool would be invaluable.  Specifically in gathering real world data for mock-ups and testing.  While "lorem ipsum" and other dummy data serves well as placeholders, it doesn't provide the same feel and value of "production" data.

If I am building an online store software, I may not have available to me the inventory data to flush out a prototype demo of my product.  It's hard to get excited about Lorem Ipsum.  But if I were to have actual data from a live website, it would better flush out my prototype, until the real data is available.

Using what I learned from my early migration, and new use cases, I developed a configurable Web Scraping tool that uses an xml data map to navigate to specific site pages, download the page, extract specific element text (using CSS selectors), download images and follow links to related data (continuing the crawl).  I've developed it as a command line tool that can be run as a scheduled task.

webscrape -m amazon.xml -o amazon-out.xml -d 20000
<?xml version="1.0" encoding="utf-8" ?>
<DataMap Name="Amazon" >
<Urls>
<Url><![CDATA[https://web.archive.org/web/20150616214557/http://www.amazon.com/gp/bestsellers/books/18]]></Url>
</Urls>
<DataMapItems>
<DataMapItem Type="text" Path="#zg_listTitle" Name="Title"/>
<DataMapItem Type="list" Path=".zg_itemImmersion" ListName="Books" Name="Book">
<DataMapItems>
<DataMapItem Type="text" Path=".zg_rankDiv" Name="Rank"/>
<DataMapItem Type="text" Path=".zg_title" Name="Title"/>
<DataMapItem Type="text" Path=".zg_byline" Name="Byline"/>
<DataMapItem Type="text" Path=".price" Name="Price"/>
<DataMapItem Type="image" Path="img" Name="Thumb"/>
</DataMapItems>
</DataMapItem>
</DataMapItems>

</DataMap>

Rather than duplicate efforts, I will post and maintain all relevant user docs on the GitHub repository.  This is a somewhat actively maintained project.  Usually if I find a scenario where I am unable to extract certain data, an update is made.  Please feel free to give me your feed back.

DISCLAIMER:

Please be aware that you need to be careful the manner in which you scrape data, the sites you target and the frequency with which you crawl a site.  I do not scrape websites directly that I do not own so I am not negatively impacting a live site.  I might suggest utilizing the waybackmachine as an alternative to the actual site.  Also, be aware that some data you scrape is copyrighted and cannot be used commercially.  So basically scrape at your own risk.

https://github.com/scdunn/WebScraper

 

Tags: c# xml data

Copyright 2019 Cidean, LLC. All rights reserved.

Proudly running Umbraco 7. This site is responsive with the help of Foundation 5.