Crawl is a widespread issue occurring in making software. News, discount news, film ticket, etc are some examples of crawl. To be simple, it is analytics HTML, read cards, and extract data. The Go library I usually use is goquery.
However, crawling an original HTML will not work in some cases: data loaded by ajax (when reading HTML, we will only see wrapper, not data), or must login when entering a page need crawl.
To these types, I use selenium to run the web in a real browser, take action to have fully loaded HTML before extracting data.
Firstly, you go to seleniumhq link to download and set up seleniumhq. Selenium plays a role like a server, receiving requests sent from my code Go.
To run it, we go to the folder containing file jar and run the command:
java -jar selenium-server-standalone-2.50.1.jar -port 8081
\=> We have server selenium running at port 8081. Next, you pull goselenium in by Go get:
go get sourcegraph.com/github.com/sourcegraph/go-selenium
After that, we need to set up a browser. I choose Firefox. Remember, when running locally, we only need to set up Firefox on the web. In contrast, running on the host we need to set up Firefox by Shell script. You can refer to how to set up Selenium on Ubuntu 14.04 Done! Now let’s code.
Run the code, we have:
Page title: Gold Box Deals | Today's Deals - Amazon.com https://images-na.ssl-images-amazon.com/images/I/51eU5JrGAXL.\_AA210\_.jpg
Well, we got all the needed information.
Above is my knowledge when having problems with crawl in developing software. Here is Go software programming language. Selenium also helps us in other cases, like pages need login, web pages request captcha, etc. If anyone has other experiences, I hope to hear from you.
Drop us a message if you need any helps from the DwarvesLet's build something