Android Web Crawler Example : Multithreaded Implementation

Web Crawler program automates task of indexing website pages.  It is also referred as Web Spidering and used extensively by Search Engines to provide effective and updated results for user search query.

Typical usage of Web Crawler includes, but not limited to:

  1. Page Indexing for Search Engines
  2. Page Content Scraping
  3. Dictionary Words Processing
  4. Syntax and Structure Validation

Given a root url, web crawler crawls content of current page and add urls extracted in processing queue of uncrawled urls. Once a page is crawled, data of crawled page is stored in database for later processing as per requirement. Task is time consuming if hyperlinks are crawled sequentially. We will create android web crawler example application which will perform parallel execution of crawling task. SQLite database will be used for saving crawled url’s record.

Android Web Crawler Example Application

1. Create an android application

2. Preparing Application Manifest File

Add permission for using Internet as crawling requires processing URL with HttpURLConnection. Final AndroidManifest.xml will be as below.

3. Preparing Layout Files

Our layout will be comprised of Button to start and stop crawling. EditText to take user input for url to be crawled. Progress Bar will be shown while crawling in running and TextView will update user with crawled pages count. Final layout /res/layout/activity_main.xml will be as below.

4. Web Crawler Implementation

Create a class WebCrawler which will define routines and objects to crawl a page. Define an interface CrawlingCallback for providing callback to crawling events like page crawling completed, page crawling failed and complete crawling done.

Define an private class CrawlerRunnable which implements Runnable. Constructor of CrawlerRunnable will have CrawlerCallback and url to be crawled as parameters. Inside this class, we will have an api to download html content body for a given url. Once we have raw html content, we need to process it to extract hyperlinks from it. For purpose of extraction of URLs, I have use jsoup which is a very good HTML parser library. Download jsoup library version 1.8.1 and add it to libs folder of your android project. Final code for CrawlerRunnable will be as below.

For processing Crawling tasks in parallel, we will use ThreadPoolExecutor which manages a work queue of runnable tasks to execute it in it’s pool. You can define starting and maximum size of pool thread. For managing CrawableTask’s, create new private class RunnableManager which will define required api’s to add task to pool and cancelling them.

For processing of URLs, we will manage a queue of urls to be crawled and HashSet of crawled urls to avoid re-crawling of an page. With addition of few methods for queueing runnable tasks and database content deletion and insertion. Final complete code for will be as below.

Once we get have finished crawling a page, we will add crawling info for this page in our record. Create a class CrawlerDB which will extends SQLiteOpenHelper to manage database. Final file will be as below.

Finaly, let’s move onto It will implement OnClickListener interface for view click callbacks. After clicking on start button, we will startCrawling task with help of WebCrawler class object and CrawlingCallback will be used to update user for crawled page count. Since it is space and time consuming time, we will stop crwaling after one minute if user do not opt for stopping it manually. Once crawling task is finished, you can check output in your LogCat for crawed urls being displayed after querying crawling database. Final code for after adding these funtionalities will be as below.

5.Build and Run Application

Now, try building and running your application. Input url for website you want to crawl, on finishing crawling you can check visited URLs into your LogCat ouptut. Information saved in crawler database can be used as per requirement.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *