Android Web Crawler Example : Multithreaded Implementation
Web Crawler program automates task of indexing website pages. It is also referred as Web Spidering and used extensively by Search Engines to provide effective and updated results for user search query.
Typical usage of Web Crawler includes, but not limited to:
- Page Indexing for Search Engines
- Page Content Scraping
- Dictionary Words Processing
- Syntax and Structure Validation
Given a root url, web crawler crawls content of current page and add urls extracted in processing queue of uncrawled urls. Once a page is crawled, data of crawled page is stored in database for later processing as per requirement. Task is time consuming if hyperlinks are crawled sequentially. We will create android web crawler example application which will perform parallel execution of crawling task. SQLite database will be used for saving crawled url’s record.
Android Web Crawler Example Application
1. Create an android application
2. Preparing Application Manifest File
Add permission for using Internet as crawling requires processing URL with HttpURLConnection. Final AndroidManifest.xml will be as below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
<?xml version="1.0" encoding="utf-8"?> <manifest xmlns:android="http://schemas.android.com/apk/res/android" package="com.android.webcrawler" android:versionCode="1" android:versionName="1.0" > <uses-sdk android:minSdkVersion="14" android:targetSdkVersion="21" /> <uses-permission android:name="android.permission.INTERNET" /> <application android:allowBackup="true" android:icon="@drawable/ic_launcher" android:label="@string/app_name" android:theme="@style/AppTheme" > <activity android:name=".MainActivity" android:label="@string/app_name" > <intent-filter> <action android:name="android.intent.action.MAIN" /> <category android:name="android.intent.category.LAUNCHER" /> </intent-filter> </activity> </application> </manifest> |
3. Preparing Layout Files
Our layout will be comprised of Button to start and stop crawling. EditText to take user input for url to be crawled. Progress Bar will be shown while crawling in running and TextView will update user with crawled pages count. Final layout /res/layout/activity_main.xml will be as below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android" xmlns:tools="http://schemas.android.com/tools" android:layout_width="match_parent" android:layout_height="match_parent" android:orientation="vertical" android:padding="20dp" tools:context="com.android.webcrawler.MainActivity"> <EditText android:id="@+id/webUrl" android:layout_width="fill_parent" android:layout_height="wrap_content" android:layout_marginTop="20dp" android:hint="Enter URL"/> <Button android:id="@+id/start" android:layout_width="fill_parent" android:layout_height="wrap_content" android:layout_marginTop="20dp" android:onClick="onClick" android:text="Start Crawling"/> <LinearLayout android:id="@+id/crawlingInfo" android:layout_width="fill_parent" android:layout_height="wrap_content" android:orientation="vertical" android:visibility="invisible"> <ProgressBar android:id="@+id/progressBar" style="?android:attr/progressBarStyleLarge" android:layout_width="wrap_content" android:layout_height="wrap_content" android:layout_gravity="center_horizontal" android:layout_marginTop="20dp"/> <TextView android:id="@+id/progressText" style="?android:attr/textAppearanceLarge" android:layout_width="fill_parent" android:layout_height="wrap_content" android:gravity="center_horizontal"/> <Button android:id="@+id/stop" android:layout_width="fill_parent" android:layout_height="wrap_content" android:layout_marginTop="20dp" android:onClick="onClick" android:text="Stop Crawling"/> </LinearLayout> </LinearLayout> |
4. Web Crawler Implementation
Create a class WebCrawler which will define routines and objects to crawl a page. Define an interface CrawlingCallback for providing callback to crawling events like page crawling completed, page crawling failed and complete crawling done.
1 2 3 4 5 6 7 8 9 10 |
/** * Interface for crawling callback */ interface CrawlingCallback { void onPageCrawlingCompleted(); void onPageCrawlingFailed(String Url, int errorCode); void onCrawlingCompleted(); }<br> |
Define an private class CrawlerRunnable which implements Runnable. Constructor of CrawlerRunnable will have CrawlerCallback and url to be crawled as parameters. Inside this class, we will have an api to download html content body for a given url. Once we have raw html content, we need to process it to extract hyperlinks from it. For purpose of extraction of URLs, I have use jsoup which is a very good HTML parser library. Download jsoup library version 1.8.1 and add it to libs folder of your android project. Final code for CrawlerRunnable will be as below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
/** * Runnable task which performs task of crawling and adding encountered URls * to crawling list * * @author CLARION * */ private class CrawlerRunnable implements Runnable { CrawlingCallback mCallback; String mUrl; public CrawlerRunnable(CrawlingCallback callback, String Url) { this.mCallback = callback; this.mUrl = Url; } @Override public void run() { String pageContent = retreiveHtmlContent(mUrl); if (!TextUtils.isEmpty(pageContent.toString())) { insertIntoCrawlerDB(mUrl, pageContent); synchronized (lock) { crawledURL.add(mUrl); } mCallback.onPageCrawlingCompleted(); } else { mCallback.onPageCrawlingFailed(mUrl, -1); } if (!TextUtils.isEmpty(pageContent.toString())) { // START // JSoup Library used to filter urls from html body Document doc = Jsoup.parse(pageContent.toString()); Elements links = doc.select("a[href]"); for (Element link : links) { String extractedLink = link.attr("href"); if (!TextUtils.isEmpty(extractedLink)) { synchronized (lock) { if (!crawledURL.contains(extractedLink)) uncrawledURL.add(extractedLink); } } } // End JSoup } // Send msg to handler that crawling for this url is finished // start more crawling tasks if queue is not empty mHandler.sendEmptyMessage(0); } private String retreiveHtmlContent(String Url) { URL httpUrl = null; try { httpUrl = new URL(Url); } catch (MalformedURLException e) { e.printStackTrace(); } int responseCode = HttpStatus.SC_OK; StringBuilder pageContent = new StringBuilder(); try { if (httpUrl != null) { HttpURLConnection conn = (HttpURLConnection) httpUrl .openConnection(); conn.setConnectTimeout(5000); conn.setReadTimeout(5000); responseCode = conn.getResponseCode(); if (responseCode != HttpStatus.SC_OK) { throw new IllegalAccessException( " http connection failed"); } BufferedReader br = new BufferedReader( new InputStreamReader(conn.getInputStream())); String line = null; while ((line = br.readLine()) != null) { pageContent.append(line); } } } catch (IOException e) { e.printStackTrace(); mCallback.onPageCrawlingFailed(Url, -1); } catch (IllegalAccessException e) { e.printStackTrace(); mCallback.onPageCrawlingFailed(Url, responseCode); } return pageContent.toString(); } } |
For processing Crawling tasks in parallel, we will use ThreadPoolExecutor which manages a work queue of runnable tasks to execute it in it’s pool. You can define starting and maximum size of pool thread. For managing CrawableTask’s, create new private class RunnableManager which will define required api’s to add task to pool and cancelling them.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
/** * Helper class to interact with ThreadPoolExecutor for adding and removing * runnable in workQueue * * @author CLARION * */ private class RunnableManager { // Sets the amount of time an idle thread will wait for a task before // terminating private static final int KEEP_ALIVE_TIME = 1; // Sets the Time Unit to seconds private final TimeUnit KEEP_ALIVE_TIME_UNIT = TimeUnit.SECONDS; // Sets the initial threadpool size to 5 private static final int CORE_POOL_SIZE = 5; // Sets the maximum threadpool size to 8 private static final int MAXIMUM_POOL_SIZE = 8; // A queue of Runnables for crawling url private final BlockingQueue<Runnable> mCrawlingQueue; // A managed pool of background crawling threads private final ThreadPoolExecutor mCrawlingThreadPool; public RunnableManager() { mCrawlingQueue = new LinkedBlockingQueue<>(); mCrawlingThreadPool = new ThreadPoolExecutor(CORE_POOL_SIZE, MAXIMUM_POOL_SIZE, KEEP_ALIVE_TIME, KEEP_ALIVE_TIME_UNIT, mCrawlingQueue); } private void addToCrawlingQueue(Runnable runnable) { mCrawlingThreadPool.execute(runnable); } private void cancelAllRunnable() { mCrawlingThreadPool.shutdownNow(); } private int getUnusedPoolSize() { return MAXIMUM_POOL_SIZE - mCrawlingThreadPool.getActiveCount(); } private boolean isShuttingDown() { return mCrawlingThreadPool.isShutdown() || mCrawlingThreadPool.isTerminating(); } } |
For processing of URLs, we will manage a queue of urls to be crawled and HashSet of crawled urls to avoid re-crawling of an page. With addition of few methods for queueing runnable tasks and database content deletion and insertion. Final complete code for WebCrawler.java will be as below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 |
package com.android.webcrawler; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.net.HttpURLConnection; import java.net.MalformedURLException; import java.net.URL; import java.util.HashSet; import java.util.concurrent.BlockingQueue; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.ThreadPoolExecutor; import java.util.concurrent.TimeUnit; import org.apache.http.HttpStatus; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import android.content.ContentValues; import android.content.Context; import android.database.sqlite.SQLiteDatabase; import android.os.Handler; import android.os.Looper; import android.text.TextUtils; public class WebCrawler { /** * Interface for crawling callback */ interface CrawlingCallback { void onPageCrawlingCompleted(); void onPageCrawlingFailed(String Url, int errorCode); void onCrawlingCompleted(); } private Context mContext; // SQLiteOpenHelper object for handling crawling database private CrawlerDB mCrawlerDB; // Set containing already visited URls private HashSet<String> crawledURL; // Queue for unvisited URL BlockingQueue<String> uncrawledURL; // For parallel crawling execution using ThreadPoolExecuter RunnableManager mManager; // Callback interface object to notify UI CrawlingCallback callback; // For sync of crawled and yet to crawl url lists Object lock; public WebCrawler(Context ctx, CrawlingCallback callback) { this.mContext = ctx; this.callback = callback; mCrawlerDB = new CrawlerDB(mContext); crawledURL = new HashSet<>(); uncrawledURL = new LinkedBlockingQueue<>(); lock = new Object(); } /** * API to add crawler runnable in ThreadPoolExecutor workQueue * * @param Url * - Url to crawl * @param isRootUrl */ public void startCrawlerTask(String Url, boolean isRootUrl) { // If it's root URl, we clear previous lists and DB table content if (isRootUrl) { crawledURL.clear(); uncrawledURL.clear(); clearDB(); mManager = new RunnableManager(); } // If ThreadPoolExecuter is not shutting down, add wunable to workQueue if (!mManager.isShuttingDown()) { CrawlerRunnable mTask = new CrawlerRunnable(callback, Url); mManager.addToCrawlingQueue(mTask); } } /** * API to shutdown ThreadPoolExecuter */ public void stopCrawlerTasks() { mManager.cancelAllRunnable(); } /** * Runnable task which performs task of crawling and adding encountered URls * to crawling list * * @author CLARION * */ private class CrawlerRunnable implements Runnable { CrawlingCallback mCallback; String mUrl; public CrawlerRunnable(CrawlingCallback callback, String Url) { this.mCallback = callback; this.mUrl = Url; } @Override public void run() { String pageContent = retreiveHtmlContent(mUrl); if (!TextUtils.isEmpty(pageContent.toString())) { insertIntoCrawlerDB(mUrl, pageContent); synchronized (lock) { crawledURL.add(mUrl); } mCallback.onPageCrawlingCompleted(); } else { mCallback.onPageCrawlingFailed(mUrl, -1); } if (!TextUtils.isEmpty(pageContent.toString())) { // START // JSoup Library used to filter urls from html body Document doc = Jsoup.parse(pageContent.toString()); Elements links = doc.select("a[href]"); for (Element link : links) { String extractedLink = link.attr("href"); if (!TextUtils.isEmpty(extractedLink)) { synchronized (lock) { if (!crawledURL.contains(extractedLink)) uncrawledURL.add(extractedLink); } } } // End JSoup } // Send msg to handler that crawling for this url is finished // start more crawling tasks if queue is not empty mHandler.sendEmptyMessage(0); } private String retreiveHtmlContent(String Url) { URL httpUrl = null; try { httpUrl = new URL(Url); } catch (MalformedURLException e) { e.printStackTrace(); } int responseCode = HttpStatus.SC_OK; StringBuilder pageContent = new StringBuilder(); try { if (httpUrl != null) { HttpURLConnection conn = (HttpURLConnection) httpUrl .openConnection(); conn.setConnectTimeout(5000); conn.setReadTimeout(5000); responseCode = conn.getResponseCode(); if (responseCode != HttpStatus.SC_OK) { throw new IllegalAccessException( " http connection failed"); } BufferedReader br = new BufferedReader( new InputStreamReader(conn.getInputStream())); String line = null; while ((line = br.readLine()) != null) { pageContent.append(line); } } } catch (IOException e) { e.printStackTrace(); mCallback.onPageCrawlingFailed(Url, -1); } catch (IllegalAccessException e) { e.printStackTrace(); mCallback.onPageCrawlingFailed(Url, responseCode); } return pageContent.toString(); } } /** * API to clear previous content of crawler DB table */ public void clearDB() { try { SQLiteDatabase db = mCrawlerDB.getWritableDatabase(); db.delete(CrawlerDB.TABLE_NAME, null, null); } catch (Exception e) { e.printStackTrace(); } } /** * API to insert crawled url info in database * * @param mUrl * - crawled url * @param result * - html body content of url */ public void insertIntoCrawlerDB(String mUrl, String result) { if (TextUtils.isEmpty(result)) return; SQLiteDatabase db = mCrawlerDB.getWritableDatabase(); ContentValues values = new ContentValues(); values.put(CrawlerDB.COLUMNS_NAME.CRAWLED_URL, mUrl); values.put(CrawlerDB.COLUMNS_NAME.CRAWLED_PAGE_CONTENT, result); db.insert(CrawlerDB.TABLE_NAME, null, values); } /** * To manage Messages in a Thread */ private Handler mHandler = new Handler(Looper.getMainLooper()) { public void handleMessage(android.os.Message msg) { synchronized (lock) { if (uncrawledURL != null && uncrawledURL.size() > 0) { int availableTasks = mManager.getUnusedPoolSize(); while (availableTasks > 0 && !uncrawledURL.isEmpty()) { startCrawlerTask(uncrawledURL.remove(), false); availableTasks--; } } } }; }; /** * Helper class to interact with ThreadPoolExecutor for adding and removing * runnable in workQueue * * @author CLARION * */ private class RunnableManager { // Sets the amount of time an idle thread will wait for a task before // terminating private static final int KEEP_ALIVE_TIME = 1; // Sets the Time Unit to seconds private final TimeUnit KEEP_ALIVE_TIME_UNIT = TimeUnit.SECONDS; // Sets the initial threadpool size to 5 private static final int CORE_POOL_SIZE = 5; // Sets the maximum threadpool size to 8 private static final int MAXIMUM_POOL_SIZE = 8; // A queue of Runnables for crawling url private final BlockingQueue<Runnable> mCrawlingQueue; // A managed pool of background crawling threads private final ThreadPoolExecutor mCrawlingThreadPool; public RunnableManager() { mCrawlingQueue = new LinkedBlockingQueue<>(); mCrawlingThreadPool = new ThreadPoolExecutor(CORE_POOL_SIZE, MAXIMUM_POOL_SIZE, KEEP_ALIVE_TIME, KEEP_ALIVE_TIME_UNIT, mCrawlingQueue); } private void addToCrawlingQueue(Runnable runnable) { mCrawlingThreadPool.execute(runnable); } private void cancelAllRunnable() { mCrawlingThreadPool.shutdownNow(); } private int getUnusedPoolSize() { return MAXIMUM_POOL_SIZE - mCrawlingThreadPool.getActiveCount(); } private boolean isShuttingDown() { return mCrawlingThreadPool.isShutdown()|| mCrawlingThreadPool.isTerminating(); } } } |
Once we get have finished crawling a page, we will add crawling info for this page in our record. Create a class CrawlerDB which will extends SQLiteOpenHelper to manage database. Final CrawlerDB.java file will be as below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
ackage com.android.webcrawler; import android.content.Context; import android.database.sqlite.SQLiteDatabase; import android.database.sqlite.SQLiteOpenHelper; /** * Helper class to manage crawler database creation and version management. * @author CLARION * */ public class CrawlerDB extends SQLiteOpenHelper{ public static final String DATABSE_NAME = "crawler.db"; public static final int DATABSE_VERSION = 1; public static final String TABLE_NAME = "CrawledURLs"; private static final String TEXT_TYPE = " TEXT"; interface COLUMNS_NAME{ String _ID = "id"; String CRAWLED_URL = "crawled_url"; String CRAWLED_PAGE_CONTENT = "crawled_page_content"; } public static final String SQL_CREATE_ENTRIES = "CREATE TABLE " + TABLE_NAME + " (" + COLUMNS_NAME._ID + " INTEGER PRIMARY KEY," + COLUMNS_NAME.CRAWLED_URL + TEXT_TYPE+"," + COLUMNS_NAME.CRAWLED_PAGE_CONTENT + TEXT_TYPE+ " )"; public static final String SQL_DELETE_ENTRIES = "DROP TABLE IF EXISTS " + TABLE_NAME; public CrawlerDB(Context context) { super(context, DATABSE_NAME, null, DATABSE_VERSION); } @Override public void onCreate(SQLiteDatabase db) { db.execSQL(SQL_CREATE_ENTRIES); } @Override public void onUpgrade(SQLiteDatabase db, int oldVersion, int newVersion) { db.execSQL(SQL_DELETE_ENTRIES); onCreate(db); } } |
Finaly, let’s move onto MainActivity.java. It will implement OnClickListener interface for view click callbacks. After clicking on start button, we will startCrawling task with help of WebCrawler class object and CrawlingCallback will be used to update user for crawled page count. Since it is space and time consuming time, we will stop crwaling after one minute if user do not opt for stopping it manually. Once crawling task is finished, you can check output in your LogCat for crawed urls being displayed after querying crawling database. Final code for MainActivity.java after adding these funtionalities will be as below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
package com.android.webcrawler; import android.app.Activity; import android.database.Cursor; import android.database.sqlite.SQLiteDatabase; import android.os.Bundle; import android.os.Handler; import android.text.TextUtils; import android.util.Log; import android.view.View; import android.view.View.OnClickListener; import android.widget.Button; import android.widget.EditText; import android.widget.LinearLayout; import android.widget.TextView; import android.widget.Toast; public class MainActivity extends Activity implements OnClickListener { private LinearLayout crawlingInfo; private Button startButton; private EditText urlInputView; private TextView progressText; // WebCrawler object will be used to start crawling on root Url private WebCrawler crawler; // count variable for url crawled so far int crawledUrlCount; // state variable to check crawling status boolean crawlingRunning; // For sending message to Handler in order to stop crawling after 60000 ms private static final int MSG_STOP_CRAWLING = 111; private static final int CRAWLING_RUNNING_TIME = 60000; @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_main); crawlingInfo = (LinearLayout) findViewById(R.id.crawlingInfo); startButton = (Button) findViewById(R.id.start); urlInputView = (EditText) findViewById(R.id.webUrl); progressText = (TextView) findViewById(R.id.progressText); crawler = new WebCrawler(this, mCallback); } /** * callback for crawling events */ private WebCrawler.CrawlingCallback mCallback = new WebCrawler.CrawlingCallback() { @Override public void onPageCrawlingCompleted() { crawledUrlCount++; progressText.post(new Runnable() { @Override public void run() { progressText.setText(crawledUrlCount + " pages crawled so far!!"); } }); } @Override public void onPageCrawlingFailed(String Url, int errorCode) { // TODO Auto-generated method stub } @Override public void onCrawlingCompleted() { stopCrawling(); } }; /** * Callback for handling button onclick events */ @Override public void onClick(View v) { int viewId = v.getId(); switch (viewId) { case R.id.start: String webUrl = urlInputView.getText().toString(); if (TextUtils.isEmpty(webUrl)) { Toast.makeText(getApplicationContext(), "Please input web Url", Toast.LENGTH_SHORT).show(); } else { crawlingRunning = true; crawler.startCrawlerTask(webUrl, true); startButton.setEnabled(false); crawlingInfo.setVisibility(View.VISIBLE); // Send delayed message to handler for stopping crawling handler.sendEmptyMessageDelayed(MSG_STOP_CRAWLING, CRAWLING_RUNNING_TIME); } break; case R.id.stop: // remove any scheduled messages if user stopped crawling by // clicking stop button handler.removeMessages(MSG_STOP_CRAWLING); stopCrawling(); break; } } private Handler handler = new Handler() { public void handleMessage(android.os.Message msg) { stopCrawling(); }; }; /** * API to handle post crawling events */ private void stopCrawling() { if (crawlingRunning) { crawler.stopCrawlerTasks(); crawlingInfo.setVisibility(View.INVISIBLE); startButton.setEnabled(true); startButton.setVisibility(View.VISIBLE); crawlingRunning = false; if (crawledUrlCount > 0) Toast.makeText(getApplicationContext(),printCrawledEntriesFromDb() + "pages crawled",Toast.LENGTH_SHORT).show(); crawledUrlCount = 0; progressText.setText(""); } } /** * API to output crawled urls in logcat * * @return number of rows saved in crawling database */ protected int printCrawledEntriesFromDb() { int count = 0; CrawlerDB mCrawlerDB = new CrawlerDB(this); SQLiteDatabase db = mCrawlerDB.getReadableDatabase(); Cursor mCursor = db.query(CrawlerDB.TABLE_NAME, null, null, null, null, null, null); if (mCursor != null && mCursor.getCount() > 0) { count = mCursor.getCount(); mCursor.moveToFirst(); int columnIndex = mCursor .getColumnIndex(CrawlerDB.COLUMNS_NAME.CRAWLED_URL); for (int i = 0; i < count; i++) { Log.d("AndroidSRC_Crawler", "Crawled Url " + mCursor.getString(columnIndex)); mCursor.moveToNext(); } } return count; } } |
5.Build and Run Application
Now, try building and running your application. Input url for website you want to crawl, on finishing crawling you can check visited URLs into your LogCat ouptut. Information saved in crawler database can be used as per requirement.