Wednesday, February 17, 2021

Mowglee, Multi-Threaded Geo Web Crawler In Java

[Updates to the Article and Codebase / Code Snippets ~ 17/Feb/2021]
- Fixed Possible Con. Leaks in Network Connections
- Fixed Poor Code and Bad Programming Practices
- Improved Code Formatting, Practiced Clean Code*
- Mowglee v0.02a is Released (Previously, v0.01a')


 This article provides the implementation of a web crawling system called Mowglee that uses geography as the main classifying criteria for crawling. Also, it runs in a multi-threaded mode that provides a default implementation of the robots exclusion protocol, sitemap generation, data classifiers, data analyzers, and a general framework for application to be built of a web crawler. The implementation is in core Java. Mowglee is a multi-threaded geo web crawler in Java.

To do this, you should have intermediate to expert level core Java skills, an understand of the intricacies of multi-threading in Java, and an understand of the real-world usage of and need for web crawlers.


What You Will Learn

  • How to write simple and distributed node-based web crawlers in core Java.

  • How to design a web crawler for geographic affinity.

  • How to write multi-threaded or asynchronous task executor-based crawlers.

  • How to write web crawlers that have modular or pluggable architecture.

  • How to analyze multiple types of structured or unstructured data. (Covered Minimally)

Geo Crawling

The crawling system that I describe here works to maximize geography penetration in terms of higher reachability. It uses the most important or higher throughput hyperlinks of a specific geography as the starting points or crawl homes. The throughput refers to a number of varied links, text, or media with a higher data relevance for a given geography. It is developed using concepts of asynchronous task execution and multi-threading in Java.
The crawler is called Mowglee. It would be better to keep the "user-agent" name the same, but you can add your own variant. For example, if you want to add the word "raremile" to identify your variant, rename the user-agent "mowglee-raremile". Please make sure you have JDK 1.6+ installed on your system before you try to run this. Some of these classes have their own main() method, but they were written only for unit testing.
The main class to run the application is in.co.mowglee.crawl.core.Mowglee. You can also run the bundled JAR file under dist using java –jar mowglee.jar . If you are using JDK 6 for execution (recommended), then you can use jvisualvm for profiling.
In the figurethe class Mowglee is shown as MowgleeCentral. 
 

Figure 01: Mowglee – Core Crawler

Figure 1: Mowglee — Core Crawler


Mowglee's core crawling system uses a hierarchy of crawlers for efficient crawling. MowgleeCrawl is a class that sequentially invokes all crawl types in Mowglee: static crawling, periphery crawling, and site crawling.
MowgleeStaticCrawl is the starting class for the crawling process. This will read the static geographical home page or crawl home. You may configure multiple crawl homes for each geography and start a MowgleeStaticCrawl process for each of them. You can visualize this as a very loose representation of a multi-agent system. There is a default safe waiting period of ten seconds that can be configured as per your needs. This is to make sure that all data is available from other running processes or other running threads before we begin the main crawl.
MowgleePeripheryCrawl is the pass one crawler that deduces the top-level domains from a given page or hyperlink. It is built with the sole purpose of making MowgleeSiteCrawl (Pass 2) easier and more measurable. The periphery crawl process can also be used to remove any duplicate top-level domains across crawls for a more concentrated effort for the next pass. In Pass 1, we only concentrate on the links and not the data.
MowgleeSiteCrawl is the Pass 2 crawler for instantiating individual thread pools using the JDK 6 executor service for each MowgleeSite. The Mowglee crawl process at this stage is very extensive and very intrusive in terms of the type of data to detect. In Pass 2, we classify link types as protocol, images, video, or audio, and we also try to gain information on metadata for the page. The most important aspect of this phase is that we do analysis in a dynamic yet controlled fashion, as we try to increase the thread pool size as per the number of the pages on the site.
Mowglee is organized as a set of workers within each of these crawling passes. For Pass 1, the actual work of reading and deduction is done by MowgleePeripheryWorkerFor Pass 2, the work of reading and analyzing links is done by MowgleeSiteWorker. The helper classes that it uses during this includes data analyzers, as explained later.
MowgleePeripheryWorker uses MowgleeDomainMap for storing the top-level domains. The most important lines of code are in MowgleeUrlStream and are used to open a socket to any given URL and read its contents, which are provided below. 


  public StringBuffer read(String httpUrl, String crawlMode) {  
   
      StringBuffer httpUrlContents = new StringBuffer();  
      InputStream inputStream = null;  
      InputStreamReader inputStreamReader = null;  
   
      MowgleeLogger mowgleeLogger = MowgleeLogger.getInstance("FILE");  
   
      // check if the url is http  
      try {  
   
        if (crawlMode.equals(MowgleeConstants.MODE_STATIC_CRAWL)) {  
           mowgleeLogger.log("trying to open the file from " + httpUrl, MowgleeUrlStream.class);  
           inputStream = new FileInputStream(new File(httpUrl));  
   
           inputStreamReader = new InputStreamReader(inputStream);  
   
        } else {  
           mowgleeLogger.log("trying to open a socket to " + httpUrl, MowgleeUrlStream.class);  
           inputStream = new URL(httpUrl).openStream();  
   
           inputStreamReader = new InputStreamReader(inputStream);  
        }  
   
        // defensive  
        StringBuffer urlContents = new StringBuffer();  
        BufferedReader bufferedReader = new BufferedReader(inputStreamReader);  
        String currentLine = bufferedReader.readLine();  
   
        while (currentLine != null) {  
           urlContents.append(currentLine);  
           currentLine = bufferedReader.readLine();  
        }  
   
        if (httpUrl != null & httpUrl.trim().length() > 0) {  
           MowgleePageReader mowgleePageReader = new MowgleePageReader();  
           mowgleePageReader.read(httpUrl, urlContents, crawlMode);  
   
           mowgleeLogger.log("the size of read contents are " + new String(urlContents).trim().length(),  
                MowgleeUrlStream.class);  
        }  
   
        // severe error fixed - possible memory leak in case of an exception! - [connection leak fixed]  
        // inputStream.close();  
      } catch (FileNotFoundException e) {  
   
        mowgleeLogger.log("the url was not found on the server due to " + e.getLocalizedMessage(),  
             MowgleeUrlStream.class);  
      } catch (MalformedURLException e) {  
   
        mowgleeLogger.log("the url was either malformed or does not exist", MowgleeUrlStream.class);  
      } catch (IOException e) {  
   
        mowgleeLogger.log("an error occured while reading the url due to " + e.getLocalizedMessage(),  
             MowgleeUrlStream.class);  
      } finally {  
   
        try {  
           // close the connection / file input stream 
           if (inputStream != null)
               inputStream.close();  
        } catch (IOException e) {  
   
           mowgleeLogger.log("an error occured while closing the connection " + e.getLocalizedMessage(),  
                MowgleeUrlStream.class);  
        }  
      }  
      return httpUrlContents;  
   }  
Listing 1: Mowglee — Opening Stream/Crawling.



Figure 02: Mowglee – Analyzers

Figure 2: Mowglee — Analyzers.


In Mowglee, there are various analyzers for multiple types of data and media. The analyzer that is implemented as part of this codebase is MowgleeLinkAnalyzer. It uses MowgleeSiteMap as the memory store for all links within a given top-level domainIt also maintains a list of visited and collected URLs from all the crawled and analyzed hyperlinks within a given top level domain.
 
 

Image title

 

Figure 3: Mowglee – Filter, Logger, and Utilities.


MowgleeGarbageCollector is a daemon thread that is instantiated and started at the time of running the main application. As there are a large number of objects instantiated per thread, this thread tries to control and enforce the internal garbage collection mechanism within safe limits of memory usage. MowgleeLogger provides the abstract class for all types of loggers in Mowglee. Also, there is an implementation of RobotsExclusionProtocol provided under MowgleeRobotsExclusionFilterThis inherits from MowgleeCrawlFilter. All other types of filters that are closer to the functioning of a crawler system may extend from this particular class.
 
MowgleeCommonUtils provides a number of common helper functions, such as deduceTopLevelDomain().  owgleeSitemapGenerator is the placeholder class for generating the sitemap as per the sitemaps protocol, and a starting point for a more extensive or custom implementation. The implementations for analyzing images, video, and audio can be added. Only the placeholders are provided along.


Applications

The following would be the ideal applications of this category of web crawlers.

Website Governance and Government Enforcements

The enforcement of any localized or customized geographical rules can be done using this system. Also, any type of classification that is either done through manual deduction or automatic detection of patterns of data for any administrative purposes can be easily done through the generated data.

Ranking Sites and Links

The ranking of sites and links for a search engine system that is further localized to a particular geography or to specific areas can be performed on the data from this site. It would be easier to find relations and click patterns of links within the same geography.

Analytics and Data Patterns

The data collected from this crawler can be analyzed using third-party or custom tools for further knowledge creation. Also, relevant digital repositories can be created from generated volumes of data.

Advertising on the Basis of Keywords

An important application would be to drive advertising based on the data collected from this crawling system. An analyzer can be used to find specific terms, keywords, phrases, and embed relevant advertising or to generate advice for advertisers.


Enhancements

Following are some enhancements involving this category of web crawlers.

Store Using Graph Database

You may add your implementation of the graph database or use one of the popular NoSQL graph storage options. The starting point is the MowgleePersistenceCore class.

Varied Analyzers

You can add more analyzers for your own organizational or academic needs. The base class that you have to extend is MowgleeAnalyzer , and you can refer to MowgleeLinkAnalyzer to understand the analyzer's implementation.

Focused Classifiers

You can add a hierarchy of classifiers or plugins to automatically classify the data based on terms, keywords, geographic lingo, or phrases. An example is MowgleeReligiousAbusePlugin.

Complex Deduction Mechanisms

More complex deduction mechanisms to create further relevance of crawling within a specific geography may be added. This includes ignoring links outside of a geographic region, country, or even continent for a specific crawl session. At a higher level, such as at the continent level, this may be implemented as a coordinated multi-agent system.

Sitemap Generation

A starting point for the sitemap generation mechanism is provided here. You may also either use other sitemap generation libraries internally or develop your own custom implementation to enhance this functionality.


Conclusion

Mowglee does not guarantee that a sitemap will be created or used for crawling. It uses an async mechanism and makes sure that all links that may be not reachable from within a site to itself are also crawled. Mowglee does not have a termination mechanism currently. I have provided a placeholder for you to decide as per your usage. This could be based on the number of top level domains crawled, data volume, number of pages crawled, geography boundaries, types of sites to crawl, or an arbitrary mechanism.
Please shut down all applications on your system to dedicate all available system resources to Mowglee. The default starting crawl point is http://www.timesofindia.com. You may try by changing the starting crawl point to other sites as well. In the current form, you can use the mowglee.crawl file to read the crawl analysis. There is no other storage mechanism provided. You may keep this file open in an editor like EditPlus to continuously monitor its contents.
This article should have given you an excellent grounding in building a layered multi-threaded crawler, especially for applications that need geographically based classifications and affinity. It should also help you save time (by re-using this code base) to build your crawlers or applications out of this. You may also want to build enhancements out of Mowglee as mentioned above and post it back on DZone for the benefit of the entire community!

[Trivia Facts About the Piece of Code / Software]
Mowglee, Efficient Geo Web Crawler [Core Java]
Mowglee v0.02a (2021) - http://bit.do/mowglee
-
0. Published on My Own Online Technical Blog and Site – Techila Shots (.in)
1. Published in Online (Java) Developer Magazine – DZone
2. Article in Online/Print Java Magazine – JavaMag (.org)
3. Published in [CS] Mag – Computer Society of India Communications
4. Final Project for Part-Time AI Course at Indian Institute of Science (IISc)
 
 

Friday, February 12, 2021

SKP's Java Problem Solving Series : Usernames Changes (HackerRank)

[Question/Problem Statement is Adapted from HackerRank]

Algorithms/Data Structures - [Problem Solving] 
There is a Specific Need for Changes in a List of Usernames. In a given List of Usernames - For Each Username - If the Username can be Modified and Moved Ahead in a Dictionary. The Allowed Modification is that Alphabets can change Positions in the Given Username.

Example
usernames[] = {"Aab", "Cat"}
 
"Aab" cannot be changed to another unique string matching the above rule - Hence, It can Never Find a Place Ahead in the Dictionary. Hence, Output will be "NO". "Cat" can be Changed to "Act", "Atc", "Tca", "Tac", "Cta" and Definitely "Act" will Find a Place Before "Cat" in the Dictionary. Hence, Output will be "YES".

[Function Description]
Complete the function possibleChanges in the Editor Below.
 
possibleChanges has the Following Parameters:
String usernames[n]: An Array of User Names
 
Returns String[n]: An Array with "YES" or "NO" Based on Feasibility
(Actual Question Says String Array, But Signature is List of Strings)


Constraints
• [No Special Constraints Exist, But Cannot Recall Exactly]


Input Format

"The First Line Contains an Integer, n, the Number of Elements in Usernames.", 
"Each Line of the n Subsequent Lines (where 0 < i < n) contains a String usernames[i]."        

[Sample Case 0 - Sample Input For Custom Testing]        
8      
Aab 
Cat
Pqrs
Buba
Bapg
Sungi
Lapg
Acba
       

Sample Output (Each Should Be on a Separate Line)
NO YES NO YES YES YES YES YES
  
______________ 
 
 
[Explanation of the Solution]
This is again a Good Question from Hacker Rank to Test Your Logic / Problem Solving Abilities. The Core Point to Handle is that For Each Combination of 2 Alphabets that Exists in the Username String > We Need to Check if the Latter Occuring Character (ASCII) is Less than the Former Occuring Character (ASCII). For Example in the String "Bapg" - For a Selection of "Ba" from "Bapg" - We have "a" Occuring Before "B" in the English Alphabet. We can Have Two Loops (One Nested) to Decide for a Combination of Each Two Alphabets. The Time Complexity of this Solution is O(n^2).
 
________________  
 

[Source Code, Sumith Puri (c) 2021 - Free to Use & Distribute]
import static java.util.stream.Collectors.joining;
import static java.util.stream.Collectors.toList;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.IntStream;

/*
* HackerRank Problem Solving - Speak Ur Mind, But Ride a Fast Horse.
* ~ Sumith Kumar Puri (c) 2021 ~ -- ~ Bengaluru, Karnataka, India ~
*
*/
class UsernamesChangesLogic {

public static List<String> possibleChanges(List<String> usernames) {

List<String> solutionStr = new ArrayList<String>();
boolean bobbysFlag = false;
for (String username : usernames) {

bobbysFlag = false;
String currName = username.toLowerCase();
for (int i = 0; i < currName.length(); i++) {

int a = currName.charAt(i);
for (int j = i + 1; j < currName.length(); j++) {

int b = currName.charAt(j);
if (b < a) {
bobbysFlag = true;
break;
}
}
if (bobbysFlag) {
solutionStr.add("YES");
break;
}
}
if (!bobbysFlag)
solutionStr.add("NO");
}

return solutionStr;
}
}

public class UsernamesChanges {

public static final String OUTPUT_PATH = "PROVIDE_ABSOLUTE_INPUT_FILE_NAME";

public static void main(String[] args) throws IOException {
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(System.in));
BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter(OUTPUT_PATH));

int usernamesCount = Integer.parseInt(bufferedReader.readLine().trim());

List<String> usernames = IntStream.range(0, usernamesCount).mapToObj(i -> {
try {
return bufferedReader.readLine();
} catch (IOException ex) {
throw new RuntimeException(ex);
}
}).collect(toList());

List<String> result = UsernamesChangesLogic.possibleChanges(usernames);

bufferedWriter.write(result.stream().collect(joining("\n")) + "\n");

bufferedReader.close();
bufferedWriter.close();
}
}

Happy Problem Solving using Java!

SKP's Java Problem Solving Series : Active Traders (HackerRank)

[Question/Problem Statement is the Property of HackerRank]

Algorithms/Data Structures - [Problem Solving] 
An Institutional Broker wants to Review their Book of Customers to see which are Most Acctive. Given a List of Trades By "Customer Name, Determine which Customers Account for At Least 5% of the Total Number of Trades. Order the List Alphabetically Ascending By Name."


Example
n = 23
"customers = {"Bigcorp", "Bigcorp", "Acme", "Bigcorp", "Zork", "Zork", "Abe", "Bigcorp",  "Acme", "Bigcorp", "Bigcorp" , "Zork", "Bigcorp", "Zork", "Zork", "Bigcorp", "Acme", "Bigcorp", "Acme", "Bigcorp", "Acme",""Littlecorp" , "Nadircorp "}."


"Bigcorp had 10 Trades out of 23,which is 43.48% of the Total Trades."

"Both Acme and Zork had 5 trades,which is 21.74% of the Total Trades."

"The Littlecorp, Nadircorp and Abe had 1 Trade Each, which is 4.35%..."

"So the Answer is [""Acme"", "" Bigcorp  ,""Zork""] (In Alphabetical Order) Because only These Three Companies Placed atleast 5% of the Trades.


Function Description

Complete the Function mostActive in the Editor Below.

mostActive
has the following parameter:
String customers[n] : An Array Customer Names

(Actual Question Says String Array, But Signature is List of Strings)

Returns String[] : An Alphabetically Ascending Array


Constraints

• 1 < n < 10^5

• 1 < Length of customers[] < 20

• The First Character of customers[i] is a Capital English letter.

• All Characters of customers[i] except for the First One are Lowercase.

• Guaranteed that At least One Customer makes atleast 5% of Trades.



Input Format
            

"The First Line contains an integer, n, The Number of Elements in customers."       

"Each Line iof the n Subsequent Lines (where 0 s i< n) contains a string, customers[i]."      


Sample Case 0 Input For Custom Testing
20       

Omega Alpha Omega Alpha Omega Alpha Omega Alpha Omega Alpha Omega Alpha Omega Alpha Omega Alpha Omega Alpha Omega Beta      


Function mostActive      
customers[] size n =  20       

customers[] = [As Provided Above]       



Sample Output

Alpha       

Beta

Omega       



Explanation

"Alpha made 10 Trades out of 20 (50% of the Total), Omega made 9 Trades (45% of the Total). and Beta made 1 Trade (5% of the Total).All of them have met the 5% Threshold, so all the Strings are Returned in an Alphabetically Ordered Array."        

 
______________ 
 
 
[Explanation of the Solution]
This is Good Practice for the Brain for Problem Solving - Involves Simple Arithmetic and Mathematical Application. Ideally, A Programmer would want to Optimize the Solution in Space and Time (Which I Did Not :-)
 
________________  
 

[Source Code, Sumith Puri (c) 2021 - Free to Use & Distribute]
import static java.util.stream.Collectors.joining;
import static java.util.stream.Collectors.toList;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.TreeMap;
import java.util.stream.IntStream;

/*
* HackerRank Problem Solving - Ain't a Horse that Can't be Rode
* Sumith Kumar Puri (c) 2021 - ~ Bengaluru, Karnataka, India ~
*
*/
class ActiveTradersLogic {

public static List<String> mostActive(List<String> customers) {

// How About Arrays or Custom LinkedList for a 'Very Fast' Traversal?
Map<String, Integer> customerMap = new TreeMap<String, Integer>();
List<String> solutionStr = new ArrayList<String>();
int customerMapSize = customers.size();

for (int i = 0; i < customerMapSize; i++) {

String customerKey = customers.get(i);

if (customerMap.containsKey(customerKey)) {

Integer customerCount = customerMap.get(customerKey);
customerMap.put(customerKey, ++customerCount);
} else {
customerMap.put(customerKey, 1);
}
}

Set<String> customerMapKeys = customerMap.keySet();
for (String customerKey : customerMapKeys) {

Integer customerCount = customerMap.get(customerKey);
double currentCustomerPercent = (double) (customerCount) / (double) customerMapSize;

if (currentCustomerPercent * 100 >= 5.0) {

solutionStr.add(customerKey);
}
}

return solutionStr;
}
}

public class ActiveTraders {

public static final String OUTPUT_PATH = "PROVIDE_ABSOLUTE_INPUT_FILE_NAME";

public static void main(String[] args) throws IOException {
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(System.in));
BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter(System.getenv(OUTPUT_PATH)));

int customersCount = Integer.parseInt(bufferedReader.readLine().trim());

List<String> customers = IntStream.range(0, customersCount).mapToObj(i -> {
try {
return bufferedReader.readLine();
} catch (IOException ex) {
throw new RuntimeException(ex);
}
}).collect(toList());

List<String> result = ActiveTradersLogic.mostActive(customers);

bufferedWriter.write(result.stream().collect(joining("\n")) + "\n");

bufferedReader.close();
bufferedWriter.close();
}
}

Happy Problem Solving using Java!

Tuesday, February 2, 2021

SKP's Java Problem Solving Series : Refresh Java Lambdas (FP)

[Question/Problem Statement is the Property of Techgig]

Java Advanced - Lambda Expressions [www.techgig.com] 
Write the Following Methods that Return a Lambda Expression Performing a Specified Action: PerformOperation isOdd(): The Lambda Expression must return  if a Number is Odd or  If it is Even. PerformOperation isPrime(): The lambda expression must return  if a number is prime or  if it is composite. PerformOperation isPalindrome(): The Lambda Expression must return  if a number is a Palindrome or if it is not.

Input Format
Input is as Show in the Format Below (Deduce Unknowns!)

Input
3
1 3
2 7
3 7777

Constraints
NA

Output Format
Output is as Show in the Format Below (Deduce Unknowns!)

Output
ODD
PRIME
PALINDROME
______________ 
 
 
[Explanation of the Solution]
This is a Good Question to Refresh Java 8 Lambdas. In my Solution, I Implemented the Functional Interfaces within my main() Method and assigned it to Local Reference Variables.
 
________________  
 

[Source Code, Sumith Puri (c) 2021 - Free to Use & Distribute]
 import java.util.Scanner;

/*    
 * Techgig Core Java Basics Problem - Knock Off Java Lambdas!   
 * Author: Sumith Puri [I Bleed Java!] // GitHub: @sumithpuri   
 */
interface LambdaYogi {
	public boolean opYog(int x);
}

public class CandidateCode {
	public static void main(String args[]) throws Exception {

                // you may choose to refactor, as this method
                // becomes really long and unmanageable #TODO
		LambdaYogi isOdd = a -> {
			boolean retFlag = false;
			if (a % 2 != 0)
				retFlag = true;
			return retFlag;
		};
		LambdaYogi isPrime = a -> {
			boolean retFlag = true;
			for (int i = 2; i < a - 1; i++) {
				if (a % i == 0) {
					retFlag = false;
					break;
				}
			}
			return retFlag;
		};
		LambdaYogi isPalindrome = a -> {
			boolean retFlag = false;
			String actStr = String.valueOf(a).trim();
			String revStr = new StringBuffer(actStr).reverse().toString();

			// using string basis, not mathematical
			if (actStr.equals(revStr))
				retFlag = true;
			return retFlag;
		};

		Scanner scanner = new Scanner(System.in);
		int val = scanner.nextInt();

		for (int i = 0; i < val; i++) {
			int op = scanner.nextInt();
			int no = scanner.nextInt();
			switch (op) {
			case 1: {
				if (isOdd.opYog(no))
					System.out.println("ODD");
				else
					System.out.println("EVEN");
				break;
			}
			case 2: {
				if (isPrime.opYog(no))
					System.out.println("PRIME");
				else
					System.out.println("COMPOSITE");
				break;
			}
			case 3: {
				if (isPalindrome.opYog(no))
					System.out.println("PALINDROME");
				else
					System.out.println("NONPALIN");
			}
			}
		}
	}
}

Happy Problem Solving using Java!