Developer Tips
Read the puppeteer documentation
It's actually quite informative to read the Puppeteer documentation at https://pptr.dev/.
I recommend reading the intro section, and then the Page API page, as that is the API you will be using most frequently.
Use log levels appropriately
While a quick console.log() can be useful during development, please remove them before committing to master. Instead, you should use the built-in logging system to output information to the console according to the following guidelines:
this.log.error()
. When you want to indicate that a non-recoverable error has occurred, call this.log.error()
. That said, it's probably more convenient to simply throw an Error for code in login()
, launch()
, generateListings()
, or processListings()
. That's because these errors will be caught by the superclass and a this.log.error()
will be invoked automatically.
this.log.warn()
. When something exceptional happens that you think should be highlighted, use this.log.warn()
. This logging level is also used to indicate that a scraper has started. Running a scraper with logging set to warn
means that there will typically be only one line of output under normal conditions.
this.log.info()
. This is the default level of logging. The goal of "info" logging is to provide the user with feedback that enables them to know that the scraper is making progress, but without overwhelming them with output. Try to be judicious. For example, if your scraper is normally going to go through 90 pages, maybe emit an info message every 5 or 10 pages so that there aren't 90 lines of output.
this.log.debug()
. You can emit as much output as you want at the debug level. Note that the Scraper superclass sets up an event handler such that all output to the puppeteer console is logged at the debug level. So, when you run a scraper with logging set to debug, you might get lines of output similar to this:
09:38:07 DEBUG LINKEDIN PUPPETEER CONSOLE: visitor.publishDestinations() result: The destination publishing iframe is already attached and loaded.
09:38:07 DEBUG LINKEDIN PUPPETEER CONSOLE: Failed to load resource: the server responded with a status of 400 ()
It can be very informative to see what is being printed to the puppeteer console, although some of this output might not have been generated by your scraper!
this.log.trace()
. Calling this will emit a stack trace at the moment of invocation.
So, for example, here is the default output for the NSF scraper:
$ npm run scrape -- -s nsf
> scraper@2.0.0 scrape
> ts-node -P tsconfig.buildScripts.json scrape.ts "-s" "nsf"
12:21:06 WARN NSF Launching NSF scraper
12:21:10 INFO NSF Wrote 100 listings.
12:21:10 INFO NSF Wrote statistics.
Since this scraper runs quickly, there's no need to augment the built-in logging messages.
On the other hand, the Simply Hired scraper default (info) output might best look like this:
$ npm run scrape -- -s simplyhired
> scraper@2.0.0 scrape
> ts-node -P tsconfig.buildScripts.json scrape.ts "-s" "simplyhired"
12:24:03 WARN SIMPLYHIRED Launching SIMPLYHIRED scraper
12:24:23 INFO SIMPLYHIRED Processed page 1, 19 internships
12:26:07 INFO SIMPLYHIRED Processed page 10, 152 internships
12:27:49 INFO SIMPLYHIRED Processed page 20, 279 internships
12:29:23 INFO SIMPLYHIRED Processed page 30, 399 internships
12:30:59 INFO SIMPLYHIRED Processed page 40, 519 internships
12:32:38 INFO SIMPLYHIRED Processed page 50, 646 internships
12:34:37 INFO SIMPLYHIRED Processed page 60, 798 internships
12:36:59 INFO SIMPLYHIRED Processed page 70, 983 internships
12:39:15 INFO SIMPLYHIRED Processed page 80, 1163 internships
12:41:28 INFO SIMPLYHIRED Processed page 90, 1343 internships
12:41:54 INFO SIMPLYHIRED Reached the end of pages!
12:41:54 INFO SIMPLYHIRED Wrote 1377 listings.
12:41:54 INFO SIMPLYHIRED Wrote statistics.
In this case, there's about 90 seconds delay between each line of output. You can write code like this to elide output in info mode but print it all out in debug mode.
const message = `Processed page ${totalPages}, ${internshipsPerPage} internships`;
((totalPages === 1) || (totalPages % 10 === 0)) ? this.log.info(message) : this.log.debug(message);
Avoid page.evaluate()
The FAQ section of https://pptr.dev/ has a question entitled "Whatβs the difference between a βtrusted" and "untrusted" input event?". It turns out that to avoid sites from blocking us as robots, we should always generate "trusted" events. This means that we should avoid the use of page.evaluate()
, which generates untrusted events. Here's a quote from the docs:
For automation purposes itβs important to generate trusted events. All input events generated with Puppeteer are trusted and fire proper accompanying events. If, for some reason, one needs an untrusted event, itβs always possible to hop into a page context with page.evaluate and generate a fake event:
await page.evaluate(() => {
document.querySelector('button[type=submit]').click();
});
We definitely want to avoid "fake events", because certain sites might use them to bar us from scraping them. Note that it's OK to use page.evaluate() if you aren't generating events (i.e. you are just inspecting the page contents). You should avoid things like .click() inside page.evaluate().
Prefer await super.getValues() or super.getValue()
Many scrapers implement code similar to this:
async oldVersionOfGetValues(selector, field) {
const returnVals = await this.page.evaluate((selector, field) => {
const vals = [];
const nodes = document.querySelectorAll(selector);
nodes.forEach(node => vals.push(node[field]));
return vals;
}, selector, field);
return returnVals;
}
I then discovered after studying the Puppeteer documentation that it could be replaced with a one-liner using page.$$eval
:
async getValues(selector, field) {
return await this.page.$$eval(selector, (nodes, field) => nodes.map(node => node[field]), field);
}
This is used sufficiently often that it is now present in the Scraper.ts superclass. So, you should replace code similar to oldVersionOfGetValues with super.getValues().
If you are only looking for a single instance of the element, then use super.getValue(), which returns the element directly, not in a list.
Page navigation patterns
There are two standard ways to navigate:
- Request a url directly.
- Click on a button or a link to navigate.
In the first case, you use super.goto(). For example:
await super.goto(this.url);
In the case of following a link or clicking a button, you need to use a more complicated piece of code:
await Promise.all([
this.page.click('input[class="c-button c-button--blue s-vgPadLeft1_5 s-vgPadRight1_5"]'),
this.page.waitForNavigation()
]);
This code ensures that both the click() and the waitForNavigation() complete before the script proceeds to the next command.
For more details, see https://pptr.dev/#?product=Puppeteer&version=v10.4.0&show=api-pagewaitfornavigationoptions.
Just to be clear: if you use super.goto()
, you don't need to add page.waitForNavigation()
. See https://stackoverflow.com/a/57881877/2038293 for details.
Prefer await super.selectorExists()
There are lots of situations in which you want to do some processing as long as there's at least one occurrence of a selector on the page. Because this is such a common pattern, the Scraper superclass provides a simple method for this:
/**
* Return true if the passed selector appears on the page.
*/
async selectorExists(selector) {
return !! await this.page.$(selector);
}
This makes it more readable to write loops:
const listingSelector = '#listing';
this.page.goto(getUrl());
while (await super.selectorExists(listingSelector)) {
// process listings on this page.
this.page.goto(getNextUrl());
}
Be kind to future you
"Future you" refers to you in several months when you have been working on other things, but have to come back to fix a broken scraper. Being kind of future you means structuring your code in such a way that it is easier to re-understand. Here are some tips:
Provide meaningful variable names to document "magic" strings
Consider the following line of code:
await this.page.waitForSelector('a[class="styles_component__1c6JC styles_defaultLink__1mFc1 styles_information__1TxGq"]');
What, precisely are we waiting for? The problem here is that the meaning of this selector string is opaque: it doesn't provide us with any information about what it is, where it might be, and why we might be waiting for it.
One good way to fix this is to assign that string to a variable whose name provides more information:
const internshipLink = 'a[class="styles_component__1c6JC styles_defaultLink__1mFc1 styles_information__1TxGq"]';
await this.page.waitForSelector(internshipLink);
A benefit of this approach over simply adding a comment string is that if you want to inspect the page manually using DevTools, you can simply copy-and-paste the line containing the variable definition into the DevTools console, which makes it easy to replicate the query using non-Puppeteer Dev Tools operations such as:
document.querySelector(internshipLink)
Avoid deep nesting
If your code has an if statement inside a while loop inside an if statement, for example, it will be hard to read.
In these cases, think about how to modularize your code. Maybe there is a block of code that can be refactored into a private method with a useful return value. That is useful for understanding, and also for debugging.
Don't use try-catch to provide "normal" control flow
Consider the following code:
try {
// Click the "Load More" button
await this.page.click('.load_more_jobs');
} catch (err) {
this.log.debug('--- All jobs are Listed, no "Load More" button --- ');
}
What is problematic about this code is that it is not an error for a page to not have a Load More button. So, the use of try-catch is not appropriate.
In this case, what is needed is to test whether or not the selector exists and only click it if so:
const loadJobsSelector = 'load_more_jobs';
if (await super.selectorExists(loadJobsSelector) {
await this.page.click(`.${loadJobsSelector}`;
}
It doesn't seem particularly interesting to provide the debugging log statement, so I've omitted it, but you could add it back in if you really wanted it as an else clause.
"Error: Navigation failed because browser has disconnected!"
Are you experiencing an intermittent error similar to this?
10:37:59 ERROR APPLE Execution context was destroyed, most likely because of a navigation.
(node:95928) UnhandledPromiseRejectionWarning: Error: Navigation failed because browser has disconnected!
at /Users/philipjohnson/github/internaloha/internaloha/scrapers-v2/node_modules/puppeteer/lib/cjs/puppeteer/common/LifecycleWatcher.js:51:147
According to this stackoverflow page, The "Navigation failed because browser has disconnected" error usually means that the node scripts that launched Puppeteer ends without waiting for the Puppeteer actions to be completed.
The stackoverflow answer goes on to debug the specific code in question, but there is a much more general answer:
Be sure that you preface every Puppeteer operation (i.e. this.page.<operation>
) with await
.
For example, there was some scraper code that generated this error occasionally. On review, the following lines were discovered:
this.page.goto(pageUrl(++pageNum), {waitUntil: 'networkidle2'});
await this.page.waitForTimeout(3000);
Because the this.page.goto
was not proceeded with an await
, that line of code returned immediately. The next line of code forced a wait of 3 seconds, which might or might not be enough time for the goto
to complete successfully. If it is enough time, then everything would be OK. If it is not enough time, then we'd get the error.
The solution is to simply add the await
, which also means we don't need the waitForTimeout
:
await this.page.goto(pageUrl(++pageNum), {waitUntil: 'networkidle2'});
So, if you are getting this error intermittently, a quick thing to do is a search for every occurrence of this.page
in your scraper code, and verify that every occurrence of this.page
is preceded by await
.
How to determine if a page has finished loading
One issue with scraper design is to ensure that the scraper does not try to operate on a page until it has loaded.
A common strategy in prior scraper implementations is to liberally insert code to insert pauses into script execution. For example, the following code pauses the script for 3 seconds each time it is executed:
await this.page.waitForTimeout(3000);
There are two problems with this approach:
- It is hard to figure out the appropriate length of time to pause. Is 3 seconds enough for all situations?
- It might slow down script execution significantly. If you insert a 3 second pause into a loop for each listing, and there are 500 listings, then you've just forced your script to require a minimum of 1500 seconds to execute.
Some times these pauses are inserted to mimic human "speed" of page manipulation, but this is only needed for a few scrapers. More often, these pauses end up being inserted as a way to wait until the page is loaded.
This stackoverflow page has a number of comments regarding this issue. From it, we can get a number of hints about how to best wait until a page has loaded.
Case 1: When you know a selector will (eventually) exist on the page
If you are sure that a page, when finally loaded, will contain the selector of interest, then your best approach is to use page.waitForSelector(). By default, the timeout is '0', which means that this command will wait indefinitely for the selector to be present on the page.
Case 2: When you don't know if a selector will (ever) exist on the page
If you are not sure that the selector of interest will exist, then things are more complicated, since you don't know if the absence of the selector is due to the selector not being present or the page not having completed loading.
First, it is important to understand that completing the "loading" process has two phases:
- Complete the downloading of all page resources (HTML, Javascript, Images, etc) from the server over the network.
- Complete the execution of all Javascript scripts on the page, since these scripts might create DOM elements.
To address (1), you can use the waitUntil
option of commands like page.goto
, as documented here. If you experience loading issues, then you might add this option with a value of networkidle0
. Please note that you do not need to set timeout
, as it is set to 0
globally by the scraper superclass.
In some cases, a page might have time-consuming Javascript scripts that execute. If you can verify that this is an issue in the site you are scraping, then you might want to consider the waitTillHTMLRendered
function, documented in this stackoverflow answer.
Try using the Google Cache
If you are getting blocked by the site, see if you can scrape the Google cache version. Instructions are here. Basically, you just need to prepend βhttp://webcache.googleusercontent.com/search?q=cache:β to the beginning of the URL.
While this can result in a somewhat out-of-date version of the site, it's normally just a few days or weeks old, which is plenty recent enough for us.
Add throttling and variation
Some sites monitor how fast requests come from a particular IP address, and block if they come in too fast or are too regular.
The slomo
command line argument creates a lower bound on how fast the scraper issues HTTP requests. This is good when developing the system with no-headless
. However, this is often more slow than is necessary for production, and does not support variation between navigation.
To prevent the scraper from traversing to pages too quickly, the super.goto()
method provides a drop-in replacement for this.page.goto()
which adds a random delay between 1 second and super.maxRandomWait
milliseconds (which currently defaults to 5000). So, use this method to automatically throttle your scraper's frequency of page requests, and to add some variation. You can change the maxRandomWait
variable in your constructor if you determine you need a different level of variability.
However, this approach won't work if you navigate using this.page.click()
. In this case, you will have to manually insert a pause. To make it easier to wait a random amount of time, you can use super.randomWait()
.