Using JSOUP to fetch data from a web page.

Last week I was working on a project in Java. I had to create a function which could read data from a web page and take decision accordingly. Almost everything was working except that my Java program was not able to read data from the web page. I looked on the web and found a good library: JSOUP.

According to their website, jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
  • scrape and parse HTML from a URL, file, or string
  • find and extract data, using DOM traversal or CSS selectors
  • manipulate the HTML elements, attributes, and text
  • clean user-submitted content against a safe white-list, to prevent XSS attacks
  • output tidy HTML
jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.


I downloaded the library, but I was not able to find a good example that could let me going in no time. So here's how I used it! :)



Explanation:

JSOUP is very simple to use. Make sure its .jar is available and you have imported it into your program.




public static String getDetails(String link) {


        Document doc; //Define a new object Document. You will be able to //create this object only after you have imported jsoup.

        Elements text = null; //Define a new object Elements
        String finalS = "";

        try {
            doc = Jsoup.connect(link).get(); //Connecting to the link.

            text = doc.select("div[class=torrentMediaInfo]"); //Select the <div> you //want to scrape.
        
          
            text.select("a").remove(); //Remove other elements. For eg, I removed all //anchors including links (<a href>).


            text.toString(); //Convert it into a string.

            finalS = text + ""; //assign the converted string to a better string variable;
           
    
        } catch (IOException e) {
            e.printStackTrace();
        }
        return finalS; //Return the string variable back!

}

//That's it! :D



You can select other sections in a web page too! For eg.

        text = doc.select("div[style=\"margin-left:10px;margin-right:10px;\"]");

will select all the content within <div style = "margin-left:10px;margin-right:10px;">

If you want to scrape content under other sections, all you have to tweak is this line:  text = doc.select("div[style=\"margin-left:10px;margin-right:10px;\"]");

Keep trying, until you get the right content. 
Lots of love! :)  



SHARE ON:

Hi! I'm Siddhant Minocha, a Jaipur based hacker/developer. I spend most of my day coding things, trying to develop the next billion dollar idea. When I'm not over-analyzing random stuff and people, I hang out with friends (mostly entrepreneurs). I'm in my final year of graduation and till now I have started and exited 3 startups and developed a lot of websites and software for different companies. Follow me on twitter and other social networks to see what I keep doing. You can also hire me to design/develop your website, if you're worried about your website or app's security or any other tech related work. See ya! :D

    Blogger Comment

0 comments:

Post a Comment