Last week I was working on a project in Java. I had to create a function which could read data from a web page and take decision accordingly. Almost everything was working except that my Java program was not able to read data from the web page. I looked on the web and found a good library: JSOUP.
According to their website, jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
I downloaded the library, but I was not able to find a good example that could let me going in no time. So here's how I used it! :)
Explanation:
JSOUP is very simple to use. Make sure its .jar is available and you have imported it into your program.
public static String getDetails(String link) {
Document doc; //Define a new object Document. You will be able to //create this object only after you have imported jsoup.
Elements text = null; //Define a new object Elements
String finalS = "";
try {
doc = Jsoup.connect(link).get(); //Connecting to the link.
text = doc.select("div[class=torrentMediaInfo]"); //Select the <div> you //want to scrape.
text.select("a").remove(); //Remove other elements. For eg, I removed all //anchors including links (<a href>).
text.toString(); //Convert it into a string.
finalS = text + ""; //assign the converted string to a better string variable;
} catch (IOException e) {
e.printStackTrace();
}
return finalS; //Return the string variable back!
}
//That's it! :D
You can select other sections in a web page too! For eg.
text = doc.select("div[style=\"margin-left:10px;margin-right:10px;\"]");
will select all the content within <div style = "margin-left:10px;margin-right:10px;">
If you want to scrape content under other sections, all you have to tweak is this line: text = doc.select("div[style=\"margin-left:10px;margin-right:10px;\"]");
Keep trying, until you get the right content.
Lots of love! :)
According to their website, jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
jsoup
implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.- scrape and parse HTML from a URL, file, or string
- find and extract data, using DOM traversal or CSS selectors
- manipulate the HTML elements, attributes, and text
- clean user-submitted content against a safe white-list, to prevent XSS attacks
- output tidy HTML
I downloaded the library, but I was not able to find a good example that could let me going in no time. So here's how I used it! :)
Explanation:
JSOUP is very simple to use. Make sure its .jar is available and you have imported it into your program.
public static String getDetails(String link) {
Document doc; //Define a new object Document. You will be able to //create this object only after you have imported jsoup.
Elements text = null; //Define a new object Elements
String finalS = "";
try {
doc = Jsoup.connect(link).get(); //Connecting to the link.
text = doc.select("div[class=torrentMediaInfo]"); //Select the <div> you //want to scrape.
text.select("a").remove(); //Remove other elements. For eg, I removed all //anchors including links (<a href>).
text.toString(); //Convert it into a string.
finalS = text + ""; //assign the converted string to a better string variable;
} catch (IOException e) {
e.printStackTrace();
}
return finalS; //Return the string variable back!
}
//That's it! :D
You can select other sections in a web page too! For eg.
text = doc.select("div[style=\"margin-left:10px;margin-right:10px;\"]");
will select all the content within <div style = "margin-left:10px;margin-right:10px;">
If you want to scrape content under other sections, all you have to tweak is this line: text = doc.select("div[style=\"margin-left:10px;margin-right:10px;\"]");
Keep trying, until you get the right content.
Lots of love! :)
0 comments:
Post a Comment