Android Web Crawler Example : Multithreaded Implementation

by proxyadmin · Published December 15, 2016 · Updated July 6, 2021

Web Crawler program automates task of indexing website pages. It is also referred as Web Spidering and used extensively by Search Engines to provide effective and updated results for user search query.

Typical usage of Web Crawler includes, but not limited to:

Page Indexing for Search Engines
Page Content Scraping
Dictionary Words Processing
Syntax and Structure Validation

Given a root url, web crawler crawls content of current page and add urls extracted in processing queue of uncrawled urls. Once a page is crawled, data of crawled page is stored in database for later processing as per requirement. Task is time consuming if hyperlinks are crawled sequentially. We will create android web crawler example application which will perform parallel execution of crawling task. SQLite database will be used for saving crawled url’s record.

Android Web Crawler Example Application

1. Create an android application

2. Preparing Application Manifest File

Add permission for using Internet as crawling requires processing URL with HttpURLConnection. Final AndroidManifest.xml will be as below.

<?xml version="1.0" encoding="utf-8"?>
<manifest xmlns:android="http://schemas.android.com/apk/res/android" package="com.android.webcrawler" android:versionCode="1" android:versionName="1.0" >
 
    <uses-sdk android:minSdkVersion="14" android:targetSdkVersion="21" />
 
    <uses-permission android:name="android.permission.INTERNET" />
 
    <application android:allowBackup="true" android:icon="@drawable/ic_launcher" android:label="@string/app_name" android:theme="@style/AppTheme" >
        <activity android:name=".MainActivity" android:label="@string/app_name" >
            <intent-filter>
                <action android:name="android.intent.action.MAIN" />
 
                <category android:name="android.intent.category.LAUNCHER" />
            </intent-filter>
        </activity>
    </application>
 
</manifest>

<?xml version="1.0" encoding="utf-8"?>

<uses-sdk android:minSdkVersion="14" android:targetSdkVersion="21" />

<uses-permission android:name="android.permission.INTERNET" />

<intent-filter>

</intent-filter>

</activity>

</application>

</manifest>

3. Preparing Layout Files

Our layout will be comprised of Button to start and stop crawling. EditText to take user input for url to be crawled. Progress Bar will be shown while crawling in running and TextView will update user with crawled pages count. Final layout /res/layout/activity_main.xml will be as below.

<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
              xmlns:tools="http://schemas.android.com/tools"
              android:layout_width="match_parent"
              android:layout_height="match_parent"
              android:orientation="vertical"
              android:padding="20dp"
              tools:context="com.android.webcrawler.MainActivity">
 
    <EditText
        android:id="@+id/webUrl"
        android:layout_width="fill_parent"
        android:layout_height="wrap_content"
        android:layout_marginTop="20dp"
        android:hint="Enter URL"/>
 
    <Button
        android:id="@+id/start"
        android:layout_width="fill_parent"
        android:layout_height="wrap_content"
        android:layout_marginTop="20dp"
        android:onClick="onClick"
        android:text="Start Crawling"/>
 
    <LinearLayout
        android:id="@+id/crawlingInfo"
        android:layout_width="fill_parent"
        android:layout_height="wrap_content"
        android:orientation="vertical"
        android:visibility="invisible">
 
        <ProgressBar
            android:id="@+id/progressBar"
            style="?android:attr/progressBarStyleLarge"
            android:layout_width="wrap_content"
            android:layout_height="wrap_content"
            android:layout_gravity="center_horizontal"
            android:layout_marginTop="20dp"/>
 
        <TextView
            android:id="@+id/progressText"
            style="?android:attr/textAppearanceLarge"
            android:layout_width="fill_parent"
            android:layout_height="wrap_content"
            android:gravity="center_horizontal"/>
 
        <Button
            android:id="@+id/stop"
            android:layout_width="fill_parent"
            android:layout_height="wrap_content"
            android:layout_marginTop="20dp"
            android:onClick="onClick"
            android:text="Stop Crawling"/>
    </LinearLayout>
 
</LinearLayout>

<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"

xmlns:tools="http://schemas.android.com/tools"

android:layout_width="match_parent"

android:layout_height="match_parent"

android:orientation="vertical"

android:padding="20dp"

tools:context="com.android.webcrawler.MainActivity">

<EditText

android:id="@+id/webUrl"

android:layout_width="fill_parent"

android:layout_height="wrap_content"

android:layout_marginTop="20dp"

android:hint="Enter URL"/>

<Button

android:id="@+id/start"

android:layout_width="fill_parent"

android:layout_height="wrap_content"

android:layout_marginTop="20dp"

android:onClick="onClick"

android:text="Start Crawling"/>

<LinearLayout

android:id="@+id/crawlingInfo"

android:layout_width="fill_parent"

android:layout_height="wrap_content"

android:orientation="vertical"

android:visibility="invisible">

<ProgressBar

android:id="@+id/progressBar"

style="?android:attr/progressBarStyleLarge"

android:layout_width="wrap_content"

android:layout_height="wrap_content"

android:layout_gravity="center_horizontal"

android:layout_marginTop="20dp"/>

<TextView

android:id="@+id/progressText"

style="?android:attr/textAppearanceLarge"

android:layout_width="fill_parent"

android:layout_height="wrap_content"

android:gravity="center_horizontal"/>

<Button

android:id="@+id/stop"

android:layout_width="fill_parent"

android:layout_height="wrap_content"

android:layout_marginTop="20dp"

android:onClick="onClick"

android:text="Stop Crawling"/>

</LinearLayout>

4. Web Crawler Implementation

Create a class WebCrawler which will define routines and objects to crawl a page. Define an interface CrawlingCallback for providing callback to crawling events like page crawling completed, page crawling failed and complete crawling done.

/**
 * Interface for crawling callback
 */
interface CrawlingCallback {
	void onPageCrawlingCompleted();
 
	void onPageCrawlingFailed(String Url, int errorCode);
 
	void onCrawlingCompleted();
}<br>

/**

* Interface for crawling callback

interface CrawlingCallback {

void onPageCrawlingCompleted();

void onPageCrawlingFailed(String Url, int errorCode);

void onCrawlingCompleted();

}<br>

Define an private class CrawlerRunnable which implements Runnable. Constructor of CrawlerRunnable will have CrawlerCallback and url to be crawled as parameters. Inside this class, we will have an api to download html content body for a given url. Once we have raw html content, we need to process it to extract hyperlinks from it. For purpose of extraction of URLs, I have use jsoup which is a very good HTML parser library. Download jsoup library version 1.8.1 and add it to libs folder of your android project. Final code for CrawlerRunnable will be as below.

/**
	 * Runnable task which performs task of crawling and adding encountered URls
	 * to crawling list
	 *
	 * @author CLARION
	 *
	 */
	private class CrawlerRunnable implements Runnable {
 
		CrawlingCallback mCallback;
		String mUrl;
 
		public CrawlerRunnable(CrawlingCallback callback, String Url) {
			this.mCallback = callback;
			this.mUrl = Url;
		}
 
		@Override
		public void run() {
			String pageContent = retreiveHtmlContent(mUrl);
 
			if (!TextUtils.isEmpty(pageContent.toString())) {
				insertIntoCrawlerDB(mUrl, pageContent);
				synchronized (lock) {
					crawledURL.add(mUrl);
				}
				mCallback.onPageCrawlingCompleted();
			} else {
				mCallback.onPageCrawlingFailed(mUrl, -1);
			}
 
			if (!TextUtils.isEmpty(pageContent.toString())) {
				// START
				// JSoup Library used to filter urls from html body
				Document doc = Jsoup.parse(pageContent.toString());
				Elements links = doc.select("a[href]");
				for (Element link : links) {
					String extractedLink = link.attr("href");
					if (!TextUtils.isEmpty(extractedLink)) {
						synchronized (lock) {
							if (!crawledURL.contains(extractedLink))
								uncrawledURL.add(extractedLink);
						}
 
					}
				}
				// End JSoup
			}
			// Send msg to handler that crawling for this url is finished
			// start more crawling tasks if queue is not empty
			mHandler.sendEmptyMessage(0);
 
		}
 
		private String retreiveHtmlContent(String Url) {
			URL httpUrl = null;
			try {
				httpUrl = new URL(Url);
			} catch (MalformedURLException e) {
				e.printStackTrace();
			}
 
			int responseCode = HttpStatus.SC_OK;
			StringBuilder pageContent = new StringBuilder();
			try {
				if (httpUrl != null) {
					HttpURLConnection conn = (HttpURLConnection) httpUrl
							.openConnection();
					conn.setConnectTimeout(5000);
					conn.setReadTimeout(5000);
					responseCode = conn.getResponseCode();
					if (responseCode != HttpStatus.SC_OK) {
						throw new IllegalAccessException(
								" http connection failed");
					}
					BufferedReader br = new BufferedReader(
							new InputStreamReader(conn.getInputStream()));
					String line = null;
					while ((line = br.readLine()) != null) {
						pageContent.append(line);
					}
				}
 
			} catch (IOException e) {
				e.printStackTrace();
				mCallback.onPageCrawlingFailed(Url, -1);
			} catch (IllegalAccessException e) {
				e.printStackTrace();
				mCallback.onPageCrawlingFailed(Url, responseCode);
			}
 
			return pageContent.toString();
		}
 
	}

/**

* Runnable task which performs task of crawling and adding encountered URls

* to crawling list

* @author CLARION

private class CrawlerRunnable implements Runnable {

CrawlingCallback mCallback;

String mUrl;

public CrawlerRunnable(CrawlingCallback callback, String Url) {

this.mCallback = callback;

this.mUrl = Url;

}

@Override

public void run() {

String pageContent = retreiveHtmlContent(mUrl);

if (!TextUtils.isEmpty(pageContent.toString())) {

insertIntoCrawlerDB(mUrl, pageContent);

synchronized (lock) {

crawledURL.add(mUrl);

}

mCallback.onPageCrawlingCompleted();

} else {

mCallback.onPageCrawlingFailed(mUrl, -1);

}

if (!TextUtils.isEmpty(pageContent.toString())) {

// START

// JSoup Library used to filter urls from html body

Document doc = Jsoup.parse(pageContent.toString());

Elements links = doc.select("a[href]");

for (Element link : links) {

String extractedLink = link.attr("href");

if (!TextUtils.isEmpty(extractedLink)) {

synchronized (lock) {

if (!crawledURL.contains(extractedLink))

uncrawledURL.add(extractedLink);

}

// End JSoup

}

// Send msg to handler that crawling for this url is finished

// start more crawling tasks if queue is not empty

mHandler.sendEmptyMessage(0);

}

private String retreiveHtmlContent(String Url) {

URL httpUrl = null;

try {

httpUrl = new URL(Url);

} catch (MalformedURLException e) {

e.printStackTrace();

}

int responseCode = HttpStatus.SC_OK;

StringBuilder pageContent = new StringBuilder();

try {

if (httpUrl != null) {

HttpURLConnection conn = (HttpURLConnection) httpUrl

.openConnection();

conn.setConnectTimeout(5000);

conn.setReadTimeout(5000);

responseCode = conn.getResponseCode();

if (responseCode != HttpStatus.SC_OK) {

throw new IllegalAccessException(

" http connection failed");

}

BufferedReader br = new BufferedReader(

new InputStreamReader(conn.getInputStream()));

String line = null;

while ((line = br.readLine()) != null) {

pageContent.append(line);

}

} catch (IOException e) {

e.printStackTrace();

mCallback.onPageCrawlingFailed(Url, -1);

} catch (IllegalAccessException e) {

e.printStackTrace();

mCallback.onPageCrawlingFailed(Url, responseCode);

}

return pageContent.toString();

}

For processing Crawling tasks in parallel, we will use ThreadPoolExecutor which manages a work queue of runnable tasks to execute it in it’s pool. You can define starting and maximum size of pool thread. For managing CrawableTask’s, create new private class RunnableManager which will define required api’s to add task to pool and cancelling them.

/**
	 * Helper class to interact with ThreadPoolExecutor for adding and removing
	 * runnable in workQueue
	 *
	 * @author CLARION
	 *
	 */
	private class RunnableManager {
 
		// Sets the amount of time an idle thread will wait for a task before
		// terminating
		private static final int KEEP_ALIVE_TIME = 1;
 
		// Sets the Time Unit to seconds
		private final TimeUnit KEEP_ALIVE_TIME_UNIT = TimeUnit.SECONDS;
 
		// Sets the initial threadpool size to 5
		private static final int CORE_POOL_SIZE = 5;
 
		// Sets the maximum threadpool size to 8
		private static final int MAXIMUM_POOL_SIZE = 8;
 
		// A queue of Runnables for crawling url
		private final BlockingQueue<Runnable> mCrawlingQueue;
 
		// A managed pool of background crawling threads
		private final ThreadPoolExecutor mCrawlingThreadPool;
 
		public RunnableManager() {
			mCrawlingQueue = new LinkedBlockingQueue<>();
			mCrawlingThreadPool = new ThreadPoolExecutor(CORE_POOL_SIZE,
					MAXIMUM_POOL_SIZE, KEEP_ALIVE_TIME, KEEP_ALIVE_TIME_UNIT,
					mCrawlingQueue);
		}
 
		private void addToCrawlingQueue(Runnable runnable) {
			mCrawlingThreadPool.execute(runnable);
		}
 
		private void cancelAllRunnable() {
			mCrawlingThreadPool.shutdownNow();
		}
 
		private int getUnusedPoolSize() {
			return MAXIMUM_POOL_SIZE - mCrawlingThreadPool.getActiveCount();
		}
 
		private boolean isShuttingDown() {
			return mCrawlingThreadPool.isShutdown()
					|| mCrawlingThreadPool.isTerminating();
		}
 
	}

/**

* Helper class to interact with ThreadPoolExecutor for adding and removing

* runnable in workQueue

* @author CLARION

private class RunnableManager {

// Sets the amount of time an idle thread will wait for a task before

// terminating

private static final int KEEP_ALIVE_TIME = 1;

// Sets the Time Unit to seconds

private final TimeUnit KEEP_ALIVE_TIME_UNIT = TimeUnit.SECONDS;

// Sets the initial threadpool size to 5

private static final int CORE_POOL_SIZE = 5;

// Sets the maximum threadpool size to 8

private static final int MAXIMUM_POOL_SIZE = 8;

// A queue of Runnables for crawling url

private final BlockingQueue<Runnable> mCrawlingQueue;

// A managed pool of background crawling threads

private final ThreadPoolExecutor mCrawlingThreadPool;

public RunnableManager() {

mCrawlingQueue = new LinkedBlockingQueue<>();

mCrawlingThreadPool = new ThreadPoolExecutor(CORE_POOL_SIZE,

MAXIMUM_POOL_SIZE, KEEP_ALIVE_TIME, KEEP_ALIVE_TIME_UNIT,

mCrawlingQueue);

}

private void addToCrawlingQueue(Runnable runnable) {

mCrawlingThreadPool.execute(runnable);

}

private void cancelAllRunnable() {

mCrawlingThreadPool.shutdownNow();

}

private int getUnusedPoolSize() {

return MAXIMUM_POOL_SIZE - mCrawlingThreadPool.getActiveCount();

}

private boolean isShuttingDown() {

return mCrawlingThreadPool.isShutdown()

|| mCrawlingThreadPool.isTerminating();

}

For processing of URLs, we will manage a queue of urls to be crawled and HashSet of crawled urls to avoid re-crawling of an page. With addition of few methods for queueing runnable tasks and database content deletion and insertion. Final complete code for WebCrawler.java will be as below.

package com.android.webcrawler;
 
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.HashSet;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
 
import org.apache.http.HttpStatus;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
 
import android.content.ContentValues;
import android.content.Context;
import android.database.sqlite.SQLiteDatabase;
import android.os.Handler;
import android.os.Looper;
import android.text.TextUtils;
 
public class WebCrawler {
 
	/**
	 * Interface for crawling callback
	 */
	interface CrawlingCallback {
		void onPageCrawlingCompleted();
 
		void onPageCrawlingFailed(String Url, int errorCode);
 
		void onCrawlingCompleted();
	}
 
	private Context mContext;
	// SQLiteOpenHelper object for handling crawling database
	private CrawlerDB mCrawlerDB;
	// Set containing already visited URls
	private HashSet<String> crawledURL;
	// Queue for unvisited URL
	BlockingQueue<String> uncrawledURL;
	// For parallel crawling execution using ThreadPoolExecuter
	RunnableManager mManager;
	// Callback interface object to notify UI
	CrawlingCallback callback;
	// For sync of crawled and yet to crawl url lists
	Object lock;
 
	public WebCrawler(Context ctx, CrawlingCallback callback) {
		this.mContext = ctx;
		this.callback = callback;
		mCrawlerDB = new CrawlerDB(mContext);
		crawledURL = new HashSet<>();
		uncrawledURL = new LinkedBlockingQueue<>();
		lock = new Object();
	}
 
	/**
	 * API to add crawler runnable in ThreadPoolExecutor workQueue
	 *
	 * @param Url
	 *            - Url to crawl
	 * @param isRootUrl
	 */
	public void startCrawlerTask(String Url, boolean isRootUrl) {
		// If it's root URl, we clear previous lists and DB table content
		if (isRootUrl) {
			crawledURL.clear();
			uncrawledURL.clear();
			clearDB();
			mManager = new RunnableManager();
		}
		// If ThreadPoolExecuter is not shutting down, add wunable to workQueue
		if (!mManager.isShuttingDown()) {
			CrawlerRunnable mTask = new CrawlerRunnable(callback, Url);
			mManager.addToCrawlingQueue(mTask);
		}
	}
 
	/**
	 * API to shutdown ThreadPoolExecuter
	 */
	public void stopCrawlerTasks() {
		mManager.cancelAllRunnable();
	}
 
	/**
	 * Runnable task which performs task of crawling and adding encountered URls
	 * to crawling list
	 *
	 * @author CLARION
	 *
	 */
	private class CrawlerRunnable implements Runnable {
 
		CrawlingCallback mCallback;
		String mUrl;
 
		public CrawlerRunnable(CrawlingCallback callback, String Url) {
			this.mCallback = callback;
			this.mUrl = Url;
		}
 
		@Override
		public void run() {
			String pageContent = retreiveHtmlContent(mUrl);
 
			if (!TextUtils.isEmpty(pageContent.toString())) {
				insertIntoCrawlerDB(mUrl, pageContent);
				synchronized (lock) {
					crawledURL.add(mUrl);
				}
				mCallback.onPageCrawlingCompleted();
			} else {
				mCallback.onPageCrawlingFailed(mUrl, -1);
			}
 
			if (!TextUtils.isEmpty(pageContent.toString())) {
				// START
				// JSoup Library used to filter urls from html body
				Document doc = Jsoup.parse(pageContent.toString());
				Elements links = doc.select("a[href]");
				for (Element link : links) {
					String extractedLink = link.attr("href");
					if (!TextUtils.isEmpty(extractedLink)) {
						synchronized (lock) {
							if (!crawledURL.contains(extractedLink))
								uncrawledURL.add(extractedLink);
						}
 
					}
				}
				// End JSoup
			}
			// Send msg to handler that crawling for this url is finished
			// start more crawling tasks if queue is not empty
			mHandler.sendEmptyMessage(0);
 
		}
 
		private String retreiveHtmlContent(String Url) {
			URL httpUrl = null;
			try {
				httpUrl = new URL(Url);
			} catch (MalformedURLException e) {
				e.printStackTrace();
			}
 
			int responseCode = HttpStatus.SC_OK;
			StringBuilder pageContent = new StringBuilder();
			try {
				if (httpUrl != null) {
					HttpURLConnection conn = (HttpURLConnection) httpUrl
							.openConnection();
					conn.setConnectTimeout(5000);
					conn.setReadTimeout(5000);
					responseCode = conn.getResponseCode();
					if (responseCode != HttpStatus.SC_OK) {
						throw new IllegalAccessException(
								" http connection failed");
					}
					BufferedReader br = new BufferedReader(
							new InputStreamReader(conn.getInputStream()));
					String line = null;
					while ((line = br.readLine()) != null) {
						pageContent.append(line);
					}
				}
 
			} catch (IOException e) {
				e.printStackTrace();
				mCallback.onPageCrawlingFailed(Url, -1);
			} catch (IllegalAccessException e) {
				e.printStackTrace();
				mCallback.onPageCrawlingFailed(Url, responseCode);
			}
 
			return pageContent.toString();
		}
 
	}
 
	/**
	 * API to clear previous content of crawler DB table
	 */
	public void clearDB() {
		try {
			SQLiteDatabase db = mCrawlerDB.getWritableDatabase();
			db.delete(CrawlerDB.TABLE_NAME, null, null);
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
 
	/**
	 * API to insert crawled url info in database
	 *
	 * @param mUrl
	 *            - crawled url
	 * @param result
	 *            - html body content of url
	 */
	public void insertIntoCrawlerDB(String mUrl, String result) {
 
		if (TextUtils.isEmpty(result))
			return;
 
		SQLiteDatabase db = mCrawlerDB.getWritableDatabase();
		ContentValues values = new ContentValues();
		values.put(CrawlerDB.COLUMNS_NAME.CRAWLED_URL, mUrl);
		values.put(CrawlerDB.COLUMNS_NAME.CRAWLED_PAGE_CONTENT, result);
 
		db.insert(CrawlerDB.TABLE_NAME, null, values);
	}
 
	/**
	 * To manage Messages in a Thread
	 */
	private Handler mHandler = new Handler(Looper.getMainLooper()) {
		public void handleMessage(android.os.Message msg) {
 
			synchronized (lock) {
				if (uncrawledURL != null && uncrawledURL.size() > 0) {
					int availableTasks = mManager.getUnusedPoolSize();
					while (availableTasks > 0 && !uncrawledURL.isEmpty()) {
						startCrawlerTask(uncrawledURL.remove(), false);
						availableTasks--;
					}
				}
			}
 
		};
	};
 
	/**
	 * Helper class to interact with ThreadPoolExecutor for adding and removing
	 * runnable in workQueue
	 *
	 * @author CLARION
	 *
	 */
	private class RunnableManager {
 
		// Sets the amount of time an idle thread will wait for a task before
		// terminating
		private static final int KEEP_ALIVE_TIME = 1;
 
		// Sets the Time Unit to seconds
		private final TimeUnit KEEP_ALIVE_TIME_UNIT = TimeUnit.SECONDS;
 
		// Sets the initial threadpool size to 5
		private static final int CORE_POOL_SIZE = 5;
 
		// Sets the maximum threadpool size to 8
		private static final int MAXIMUM_POOL_SIZE = 8;
 
		// A queue of Runnables for crawling url
		private final BlockingQueue<Runnable> mCrawlingQueue;
 
		// A managed pool of background crawling threads
		private final ThreadPoolExecutor mCrawlingThreadPool;
 
		public RunnableManager() {
			mCrawlingQueue = new LinkedBlockingQueue<>();
			mCrawlingThreadPool = new ThreadPoolExecutor(CORE_POOL_SIZE,
					MAXIMUM_POOL_SIZE, KEEP_ALIVE_TIME, KEEP_ALIVE_TIME_UNIT,
					mCrawlingQueue);
		}
 
		private void addToCrawlingQueue(Runnable runnable) {
			mCrawlingThreadPool.execute(runnable);
		}
 
		private void cancelAllRunnable() {
			mCrawlingThreadPool.shutdownNow();
		}
 
		private int getUnusedPoolSize() {
			return MAXIMUM_POOL_SIZE - mCrawlingThreadPool.getActiveCount();
		}
 
		private boolean isShuttingDown() {
			return mCrawlingThreadPool.isShutdown()|| mCrawlingThreadPool.isTerminating();
		}
	}
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

package com.android.webcrawler;

import java.io.BufferedReader;

import java.io.IOException;

import java.io.InputStreamReader;

import java.net.HttpURLConnection;

import java.net.MalformedURLException;

import java.net.URL;

import java.util.HashSet;

import java.util.concurrent.BlockingQueue;

import java.util.concurrent.LinkedBlockingQueue;

import java.util.concurrent.ThreadPoolExecutor;

import java.util.concurrent.TimeUnit;

import org.apache.http.HttpStatus;

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import org.jsoup.nodes.Element;

import org.jsoup.select.Elements;

import android.content.ContentValues;

import android.content.Context;

import android.database.sqlite.SQLiteDatabase;

import android.os.Handler;

import android.os.Looper;

import android.text.TextUtils;

public class WebCrawler {

/**

* Interface for crawling callback

interface CrawlingCallback {

void onPageCrawlingCompleted();

void onPageCrawlingFailed(String Url, int errorCode);

void onCrawlingCompleted();

}

private Context mContext;

// SQLiteOpenHelper object for handling crawling database

private CrawlerDB mCrawlerDB;

// Set containing already visited URls

private HashSet<String> crawledURL;

// Queue for unvisited URL

BlockingQueue<String> uncrawledURL;

// For parallel crawling execution using ThreadPoolExecuter

RunnableManager mManager;

// Callback interface object to notify UI

CrawlingCallback callback;

// For sync of crawled and yet to crawl url lists

Object lock;

public WebCrawler(Context ctx, CrawlingCallback callback) {

this.mContext = ctx;

this.callback = callback;

mCrawlerDB = new CrawlerDB(mContext);

crawledURL = new HashSet<>();

uncrawledURL = new LinkedBlockingQueue<>();

lock = new Object();

}

/**

* API to add crawler runnable in ThreadPoolExecutor workQueue

* @param Url

* - Url to crawl

* @param isRootUrl

public void startCrawlerTask(String Url, boolean isRootUrl) {

// If it's root URl, we clear previous lists and DB table content

if (isRootUrl) {

crawledURL.clear();

uncrawledURL.clear();

clearDB();

mManager = new RunnableManager();

}

// If ThreadPoolExecuter is not shutting down, add wunable to workQueue

if (!mManager.isShuttingDown()) {

CrawlerRunnable mTask = new CrawlerRunnable(callback, Url);

mManager.addToCrawlingQueue(mTask);

}

/**

* API to shutdown ThreadPoolExecuter

public void stopCrawlerTasks() {

mManager.cancelAllRunnable();

}

/**

* Runnable task which performs task of crawling and adding encountered URls

* to crawling list

* @author CLARION

private class CrawlerRunnable implements Runnable {

CrawlingCallback mCallback;

String mUrl;

public CrawlerRunnable(CrawlingCallback callback, String Url) {

this.mCallback = callback;

this.mUrl = Url;

}

@Override

public void run() {

String pageContent = retreiveHtmlContent(mUrl);

if (!TextUtils.isEmpty(pageContent.toString())) {

insertIntoCrawlerDB(mUrl, pageContent);

synchronized (lock) {

crawledURL.add(mUrl);

}

mCallback.onPageCrawlingCompleted();

} else {

mCallback.onPageCrawlingFailed(mUrl, -1);

}

if (!TextUtils.isEmpty(pageContent.toString())) {

// START

// JSoup Library used to filter urls from html body

Document doc = Jsoup.parse(pageContent.toString());

Elements links = doc.select("a[href]");

for (Element link : links) {

String extractedLink = link.attr("href");

if (!TextUtils.isEmpty(extractedLink)) {

synchronized (lock) {

if (!crawledURL.contains(extractedLink))

uncrawledURL.add(extractedLink);

}

// End JSoup

}

// Send msg to handler that crawling for this url is finished

// start more crawling tasks if queue is not empty

mHandler.sendEmptyMessage(0);

}

private String retreiveHtmlContent(String Url) {

URL httpUrl = null;

try {

httpUrl = new URL(Url);

} catch (MalformedURLException e) {

e.printStackTrace();

}

int responseCode = HttpStatus.SC_OK;

StringBuilder pageContent = new StringBuilder();

try {

if (httpUrl != null) {

HttpURLConnection conn = (HttpURLConnection) httpUrl

.openConnection();

conn.setConnectTimeout(5000);

conn.setReadTimeout(5000);

responseCode = conn.getResponseCode();

if (responseCode != HttpStatus.SC_OK) {

throw new IllegalAccessException(

" http connection failed");

}

BufferedReader br = new BufferedReader(

new InputStreamReader(conn.getInputStream()));

String line = null;

while ((line = br.readLine()) != null) {

pageContent.append(line);

}

} catch (IOException e) {

e.printStackTrace();

mCallback.onPageCrawlingFailed(Url, -1);

} catch (IllegalAccessException e) {

e.printStackTrace();

mCallback.onPageCrawlingFailed(Url, responseCode);

}

return pageContent.toString();

}

/**

* API to clear previous content of crawler DB table

public void clearDB() {

try {

SQLiteDatabase db = mCrawlerDB.getWritableDatabase();

db.delete(CrawlerDB.TABLE_NAME, null, null);

} catch (Exception e) {

e.printStackTrace();

}

/**

* API to insert crawled url info in database

* @param mUrl

* - crawled url

* @param result

* - html body content of url

public void insertIntoCrawlerDB(String mUrl, String result) {

if (TextUtils.isEmpty(result))

return;

SQLiteDatabase db = mCrawlerDB.getWritableDatabase();

ContentValues values = new ContentValues();

values.put(CrawlerDB.COLUMNS_NAME.CRAWLED_URL, mUrl);

values.put(CrawlerDB.COLUMNS_NAME.CRAWLED_PAGE_CONTENT, result);

db.insert(CrawlerDB.TABLE_NAME, null, values);

}

/**

* To manage Messages in a Thread

private Handler mHandler = new Handler(Looper.getMainLooper()) {

public void handleMessage(android.os.Message msg) {

synchronized (lock) {

if (uncrawledURL != null && uncrawledURL.size() > 0) {

int availableTasks = mManager.getUnusedPoolSize();

while (availableTasks > 0 && !uncrawledURL.isEmpty()) {

startCrawlerTask(uncrawledURL.remove(), false);

availableTasks--;

}

};

/**

* Helper class to interact with ThreadPoolExecutor for adding and removing

* runnable in workQueue

* @author CLARION

private class RunnableManager {

// Sets the amount of time an idle thread will wait for a task before

// terminating

private static final int KEEP_ALIVE_TIME = 1;

// Sets the Time Unit to seconds

private final TimeUnit KEEP_ALIVE_TIME_UNIT = TimeUnit.SECONDS;

// Sets the initial threadpool size to 5

private static final int CORE_POOL_SIZE = 5;

// Sets the maximum threadpool size to 8

private static final int MAXIMUM_POOL_SIZE = 8;

// A queue of Runnables for crawling url

private final BlockingQueue<Runnable> mCrawlingQueue;

// A managed pool of background crawling threads

private final ThreadPoolExecutor mCrawlingThreadPool;

public RunnableManager() {

mCrawlingQueue = new LinkedBlockingQueue<>();

mCrawlingThreadPool = new ThreadPoolExecutor(CORE_POOL_SIZE,

MAXIMUM_POOL_SIZE, KEEP_ALIVE_TIME, KEEP_ALIVE_TIME_UNIT,

mCrawlingQueue);

}

private void addToCrawlingQueue(Runnable runnable) {

mCrawlingThreadPool.execute(runnable);

}

private void cancelAllRunnable() {

mCrawlingThreadPool.shutdownNow();

}

private int getUnusedPoolSize() {

return MAXIMUM_POOL_SIZE - mCrawlingThreadPool.getActiveCount();

}

private boolean isShuttingDown() {

return mCrawlingThreadPool.isShutdown()|| mCrawlingThreadPool.isTerminating();

}

Once we get have finished crawling a page, we will add crawling info for this page in our record. Create a class CrawlerDB which will extends SQLiteOpenHelper to manage database. Final CrawlerDB.java file will be as below.

ackage com.android.webcrawler;
 
import android.content.Context;
import android.database.sqlite.SQLiteDatabase;
import android.database.sqlite.SQLiteOpenHelper;
 
/**
 * Helper class to manage crawler database creation and version management.
 * @author CLARION
 *
 */
public class CrawlerDB extends SQLiteOpenHelper{
 
	public static final String DATABSE_NAME = "crawler.db";
	public static final int DATABSE_VERSION = 1;
	public static final String TABLE_NAME = "CrawledURLs";
	private static final String TEXT_TYPE = " TEXT";
 
	interface COLUMNS_NAME{
		String _ID = "id";
		String CRAWLED_URL = "crawled_url";
		String CRAWLED_PAGE_CONTENT = "crawled_page_content";
	}
 
	public static final String SQL_CREATE_ENTRIES =
		    "CREATE TABLE " + TABLE_NAME + " (" +
		    COLUMNS_NAME._ID + " INTEGER PRIMARY KEY," +
		    COLUMNS_NAME.CRAWLED_URL + TEXT_TYPE+"," +
		    COLUMNS_NAME.CRAWLED_PAGE_CONTENT + TEXT_TYPE+
		    " )";
 
	public static final String SQL_DELETE_ENTRIES =
		    "DROP TABLE IF EXISTS " + TABLE_NAME;
 
	public CrawlerDB(Context context) {
		super(context, DATABSE_NAME, null, DATABSE_VERSION);
	}
 
	@Override
	public void onCreate(SQLiteDatabase db) {
		db.execSQL(SQL_CREATE_ENTRIES);
	}
 
	@Override
	public void onUpgrade(SQLiteDatabase db, int oldVersion, int newVersion) {
		db.execSQL(SQL_DELETE_ENTRIES);
		onCreate(db);
	}
}

ackage com.android.webcrawler;

import android.content.Context;

import android.database.sqlite.SQLiteDatabase;

import android.database.sqlite.SQLiteOpenHelper;

/**

* Helper class to manage crawler database creation and version management.

* @author CLARION

public class CrawlerDB extends SQLiteOpenHelper{

public static final String DATABSE_NAME = "crawler.db";

public static final int DATABSE_VERSION = 1;

public static final String TABLE_NAME = "CrawledURLs";

private static final String TEXT_TYPE = " TEXT";

interface COLUMNS_NAME{

String _ID = "id";

String CRAWLED_URL = "crawled_url";

String CRAWLED_PAGE_CONTENT = "crawled_page_content";

}

public static final String SQL_CREATE_ENTRIES =

"CREATE TABLE " + TABLE_NAME + " (" +

COLUMNS_NAME._ID + " INTEGER PRIMARY KEY," +

COLUMNS_NAME.CRAWLED_URL + TEXT_TYPE+"," +

COLUMNS_NAME.CRAWLED_PAGE_CONTENT + TEXT_TYPE+

" )";

public static final String SQL_DELETE_ENTRIES =

"DROP TABLE IF EXISTS " + TABLE_NAME;

public CrawlerDB(Context context) {

super(context, DATABSE_NAME, null, DATABSE_VERSION);

}

@Override

public void onCreate(SQLiteDatabase db) {

db.execSQL(SQL_CREATE_ENTRIES);

}

@Override

public void onUpgrade(SQLiteDatabase db, int oldVersion, int newVersion) {

db.execSQL(SQL_DELETE_ENTRIES);

onCreate(db);

}

Finaly, let’s move onto MainActivity.java. It will implement OnClickListener interface for view click callbacks. After clicking on start button, we will startCrawling task with help of WebCrawler class object and CrawlingCallback will be used to update user for crawled page count. Since it is space and time consuming time, we will stop crwaling after one minute if user do not opt for stopping it manually. Once crawling task is finished, you can check output in your LogCat for crawed urls being displayed after querying crawling database. Final code for MainActivity.java after adding these funtionalities will be as below.

package com.android.webcrawler;
 
import android.app.Activity;
import android.database.Cursor;
import android.database.sqlite.SQLiteDatabase;
import android.os.Bundle;
import android.os.Handler;
import android.text.TextUtils;
import android.util.Log;
import android.view.View;
import android.view.View.OnClickListener;
import android.widget.Button;
import android.widget.EditText;
import android.widget.LinearLayout;
import android.widget.TextView;
import android.widget.Toast;
 
public class MainActivity extends Activity implements OnClickListener {
 
	private LinearLayout crawlingInfo;
	private Button startButton;
	private EditText urlInputView;
	private TextView progressText;
 
	// WebCrawler object will be used to start crawling on root Url
	private WebCrawler crawler;
	// count variable for url crawled so far
	int crawledUrlCount;
	// state variable to check crawling status
	boolean crawlingRunning;
	// For sending message to Handler in order to stop crawling after 60000 ms
	private static final int MSG_STOP_CRAWLING = 111;
	private static final int CRAWLING_RUNNING_TIME = 60000;
 
	@Override
	protected void onCreate(Bundle savedInstanceState) {
		super.onCreate(savedInstanceState);
		setContentView(R.layout.activity_main);
 
		crawlingInfo = (LinearLayout) findViewById(R.id.crawlingInfo);
		startButton = (Button) findViewById(R.id.start);
		urlInputView = (EditText) findViewById(R.id.webUrl);
		progressText = (TextView) findViewById(R.id.progressText);
 
		crawler = new WebCrawler(this, mCallback);
	}
 
	/**
	 * callback for crawling events
	 */
	private WebCrawler.CrawlingCallback mCallback = new WebCrawler.CrawlingCallback() {
 
		@Override
		public void onPageCrawlingCompleted() {
			crawledUrlCount++;
			progressText.post(new Runnable() {
 
				@Override
				public void run() {
					progressText.setText(crawledUrlCount
							+ " pages crawled so far!!");
 
				}
			});
		}
 
		@Override
		public void onPageCrawlingFailed(String Url, int errorCode) {
			// TODO Auto-generated method stub
		}
 
		@Override
		public void onCrawlingCompleted() {
			stopCrawling();
		}
	};
 
	/**
	 * Callback for handling button onclick events
	 */
	@Override
	public void onClick(View v) {
		int viewId = v.getId();
		switch (viewId) {
		case R.id.start:
			String webUrl = urlInputView.getText().toString();
			if (TextUtils.isEmpty(webUrl)) {
				Toast.makeText(getApplicationContext(), "Please input web Url",
						Toast.LENGTH_SHORT).show();
			} else {
				crawlingRunning = true;
				crawler.startCrawlerTask(webUrl, true);
				startButton.setEnabled(false);
				crawlingInfo.setVisibility(View.VISIBLE);
				// Send delayed message to handler for stopping crawling
				handler.sendEmptyMessageDelayed(MSG_STOP_CRAWLING,
						CRAWLING_RUNNING_TIME);
			}
			break;
		case R.id.stop:
			// remove any scheduled messages if user stopped crawling by
			// clicking stop button
			handler.removeMessages(MSG_STOP_CRAWLING);
			stopCrawling();
			break;
		}
	}
 
	private Handler handler = new Handler() {
		public void handleMessage(android.os.Message msg) {
			stopCrawling();
		};
	};
 
	/**
	 * API to handle post crawling events
	 */
	private void stopCrawling() {
		if (crawlingRunning) {
			crawler.stopCrawlerTasks();
			crawlingInfo.setVisibility(View.INVISIBLE);
			startButton.setEnabled(true);
			startButton.setVisibility(View.VISIBLE);
			crawlingRunning = false;
			if (crawledUrlCount > 0)
				Toast.makeText(getApplicationContext(),printCrawledEntriesFromDb() + "pages crawled",Toast.LENGTH_SHORT).show();
 
			crawledUrlCount = 0;
			progressText.setText("");
		}
	}
 
	/**
	 * API to output crawled urls in logcat
	 *
	 * @return number of rows saved in crawling database
	 */
	protected int printCrawledEntriesFromDb() {
 
		int count = 0;
		CrawlerDB mCrawlerDB = new CrawlerDB(this);
		SQLiteDatabase db = mCrawlerDB.getReadableDatabase();
 
		Cursor mCursor = db.query(CrawlerDB.TABLE_NAME, null, null, null, null,
				null, null);
		if (mCursor != null && mCursor.getCount() > 0) {
			count = mCursor.getCount();
			mCursor.moveToFirst();
			int columnIndex = mCursor
					.getColumnIndex(CrawlerDB.COLUMNS_NAME.CRAWLED_URL);
			for (int i = 0; i < count; i++) {
				Log.d("AndroidSRC_Crawler",
						"Crawled Url " + mCursor.getString(columnIndex));
				mCursor.moveToNext();
			}
		}
 
		return count;
	}
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

package com.android.webcrawler;

import android.app.Activity;

import android.database.Cursor;

import android.database.sqlite.SQLiteDatabase;

import android.os.Bundle;

import android.os.Handler;

import android.text.TextUtils;

import android.util.Log;

import android.view.View;

import android.view.View.OnClickListener;

import android.widget.Button;

import android.widget.EditText;

import android.widget.LinearLayout;

import android.widget.TextView;

import android.widget.Toast;

public class MainActivity extends Activity implements OnClickListener {

private LinearLayout crawlingInfo;

private Button startButton;

private EditText urlInputView;

private TextView progressText;

// WebCrawler object will be used to start crawling on root Url

private WebCrawler crawler;

// count variable for url crawled so far

int crawledUrlCount;

// state variable to check crawling status

boolean crawlingRunning;

// For sending message to Handler in order to stop crawling after 60000 ms

private static final int MSG_STOP_CRAWLING = 111;

private static final int CRAWLING_RUNNING_TIME = 60000;

@Override

protected void onCreate(Bundle savedInstanceState) {

super.onCreate(savedInstanceState);

setContentView(R.layout.activity_main);

crawlingInfo = (LinearLayout) findViewById(R.id.crawlingInfo);

startButton = (Button) findViewById(R.id.start);

urlInputView = (EditText) findViewById(R.id.webUrl);

progressText = (TextView) findViewById(R.id.progressText);

crawler = new WebCrawler(this, mCallback);

}

/**

* callback for crawling events

private WebCrawler.CrawlingCallback mCallback = new WebCrawler.CrawlingCallback() {

@Override

public void onPageCrawlingCompleted() {

crawledUrlCount++;

progressText.post(new Runnable() {

@Override

public void run() {

progressText.setText(crawledUrlCount

+ " pages crawled so far!!");

}

});

}

@Override

public void onPageCrawlingFailed(String Url, int errorCode) {

// TODO Auto-generated method stub

}

@Override

public void onCrawlingCompleted() {

stopCrawling();

}

};

/**

* Callback for handling button onclick events

@Override

public void onClick(View v) {

int viewId = v.getId();

switch (viewId) {

case R.id.start:

String webUrl = urlInputView.getText().toString();

if (TextUtils.isEmpty(webUrl)) {

Toast.makeText(getApplicationContext(), "Please input web Url",

Toast.LENGTH_SHORT).show();

} else {

crawlingRunning = true;

crawler.startCrawlerTask(webUrl, true);

startButton.setEnabled(false);

crawlingInfo.setVisibility(View.VISIBLE);

// Send delayed message to handler for stopping crawling

handler.sendEmptyMessageDelayed(MSG_STOP_CRAWLING,

CRAWLING_RUNNING_TIME);

}

break;

case R.id.stop:

// remove any scheduled messages if user stopped crawling by

// clicking stop button

handler.removeMessages(MSG_STOP_CRAWLING);

stopCrawling();

break;

}

private Handler handler = new Handler() {

public void handleMessage(android.os.Message msg) {

stopCrawling();

};

/**

* API to handle post crawling events

private void stopCrawling() {

if (crawlingRunning) {

crawler.stopCrawlerTasks();

crawlingInfo.setVisibility(View.INVISIBLE);

startButton.setEnabled(true);

startButton.setVisibility(View.VISIBLE);

crawlingRunning = false;

if (crawledUrlCount > 0)

Toast.makeText(getApplicationContext(),printCrawledEntriesFromDb() + "pages crawled",Toast.LENGTH_SHORT).show();

crawledUrlCount = 0;

progressText.setText("");

}

/**

* API to output crawled urls in logcat

* @return number of rows saved in crawling database

protected int printCrawledEntriesFromDb() {

int count = 0;

CrawlerDB mCrawlerDB = new CrawlerDB(this);

SQLiteDatabase db = mCrawlerDB.getReadableDatabase();

Cursor mCursor = db.query(CrawlerDB.TABLE_NAME, null, null, null, null,

null, null);

if (mCursor != null && mCursor.getCount() > 0) {

count = mCursor.getCount();

mCursor.moveToFirst();

int columnIndex = mCursor

.getColumnIndex(CrawlerDB.COLUMNS_NAME.CRAWLED_URL);

for (int i = 0; i < count; i++) {

Log.d("AndroidSRC_Crawler",

"Crawled Url " + mCursor.getString(columnIndex));

mCursor.moveToNext();

}

return count;

}

5.Build and Run Application

Now, try building and running your application. Input url for website you want to crawl, on finishing crawling you can check visited URLs into your LogCat ouptut. Information saved in crawler database can be used as per requirement.

Android Web Crawler Example : Multithreaded Implementation

Android Web Crawler Example Application

You may also like...

Leave a Reply Cancel reply

Browse by Category

Android Web Crawler Example : Multithreaded Implementation

Android Web Crawler Example Application

You may also like...

How to Text Someone Who Blocked You: A Quick Guide

7 Reasons to Switch from Android to iPhone (& 6 Not to)

Dressroom App by Samsung

Leave a Reply Cancel reply

Browse by Category