Invocation
npm run scrape -- -s <scraper>
#
This is the simplest version of the script, which runs a single scraper. For example:
Currently, this command produces the following output:
You will see that a file called nsf.dev.json
has been written to the listings/compsci
directory, and a file called (for example) nsf-2021-10-18.dev.json
has been written to the statistics/compsci
directory.
npm run scrape -- --help
#
There are many options for customizing the run of a scraper. To see them, invoke help:
Here is the output from a recent run. There may be additional options or changes in your version.
You can provide any combination of these parameters, in any order. The only required parameter is the scraper.
#
Why no multi-scraper invocation?In the previous version of the scraper, we discovered that puppeteer is not "thread safe", in the sense that running multiple scrapers simultaneously can result in execution errors that do not appear when running each scraper individually.
To avoid this problem, the scrape
script supports running of only a single scraper. To support batch execution of multiple scrapers, we have created a Unix shell script (run-scrapers.sh) that invokes the scrape
script multiple times, once per scraper. This will isolate each run of the scraper in its own OS process and prevent these sorts of problems from occurring. We will create a Windows version of this script eventually.
#
Example: NSF ScraperHere is the default run of the NSF scraper. The log level defaults to 'info', so there's very little output.
Running the scraper with log level 'debug' produces a lot of output, much of which I'll elide:
#
Multi-discipline supportThe scrape script provides a --discipline parameter that defaults to "compsci" but also supports "compeng". The value of this parameter is available to each scraper in a field called "discipline". Each scraper can consult the value of the discipline field and alter their search behavior if they want to implement discipline-specific internship listings.
The discipline parameter also affects where the choice of directory where the listing and statistics files are written. The compsci files are written into listings/compsci
and statistics/compsci
. The compeng files are written into listings/compeng
and statistics/compeng
.
At this time, the scrapers do not change their behavior according to the value of the --discipline parameter. So, if you call the scrape script with "--discipline compeng", the only impact will be to write out the listing and statistics files to a different subdirectory.
#
Generating statisticsEach time you run a scraper, a json file is written to a subdirectory of /statistics
containing information about that run. The file name contains the timestamp YYYY-MM-DD, so statistics are only maintained for the last run of the day.
For example, here are the contents of statistics/compsci/nsf-2021-10-08.dev.json
:
You can run the "statistics" script to read all of the files in the statistics directory and generate a set of CSV files that provide historical trends for the scrapers:
During development, statistics files are generated with a ".dev.csv" extension. This means you can look at them, but they are not committed to GitHub.
Running the statistics script with the --commit-files flag eliminates the ".dev" suffix component and thus allows the statistics files to be committed to GitHub.
You can browse to those files directory to obtain a usable tabular representation of run data.
Invoke the statistics script with the --help
option to see all available options:
#
Development mode (don't commit output files)During development, people will be running scrapers and generating both listing and statistics "output" files in their branches. This could lead to lots of spurious merge conflicts when trying to merge your branches back into main.
To avoid this problem, both the scrape and statistics scripts have a flag called "--commit-files" which is (currently) false by default. When false, all listing file names have a ".dev.json" suffix, and all statistics file names have a ".dev.csv" suffix. Both of these suffixes are git-ignored, with the result that all output files you create during development are not committed to your branch or to main.
If you want your data files to be committed, then you just run either script with the option "--commit-files", which makes that flag true. Then the associated output files are created with ".json" (rather than ".dev.json") or ".csv" (rather than ".dev.csv"), and so they will not be git-ignored.