public class ScrapeState extends Object
| Constructor and Description |
|---|
ScrapeState(List<CrawlRecord> pagesToBeScraped) |
| Modifier and Type | Method and Description |
|---|---|
void |
addFailedToScrapeURL(CrawlRecord record)
Adds the given CrawlRecord to the list of CrawlRecords NOT successfully scraped.
|
void |
addNquads(String key,
String nquads) |
void |
addSuccessfulScrapedURL(CrawlRecord record)
Adds the given CrawlRecord to the list of CrawlRecords successfully scraped.
|
Map<String,Object> |
getNquadsConcurrentHashMap() |
int |
getNumberPagesLeftToScrape()
Returns the number of URLs that are still to be scraped in this cycle.
|
List<CrawlRecord> |
getPagesProcessed()
Gets the full list of URLs that have been processed in this cycle.
|
List<CrawlRecord> |
getPagesProcessedAndUnprocessed()
Gets the full list of URLs/CrawlRecords regardless of whether scraped or not in the current cycle.
|
CrawlRecord |
getURLToProcess()
Returns the next URL/CrawlRecord to be scraped
|
boolean |
pagesLeftToScrape()
Any pages/URLs left to scrape?
|
void |
setStatusTo404(CrawlRecord record)
Changes the status of the CrawlRecord to DOES_NOT_EXIST.
|
void |
setStatusToHumanInspection(CrawlRecord record)
Changes the status of the CrawlRecord to HUMAN_INSPECTION.
|
public ScrapeState(List<CrawlRecord> pagesToBeScraped)
pagesToBeScraped - The list of sites to be scrapedScrapeThread,
CrawlRecordpublic boolean pagesLeftToScrape()
CrawlRecordpublic CrawlRecord getURLToProcess()
CrawlRecordpublic void addSuccessfulScrapedURL(CrawlRecord record)
url - The latest URL/page that has been successfully scrapedCrawlRecordpublic void addFailedToScrapeURL(CrawlRecord record)
url - The latest URL/page that has been unsuccessfully scrapedCrawlRecordpublic void setStatusTo404(CrawlRecord record)
url - The latest URL/page that has been 404'dCrawlRecordpublic void setStatusToHumanInspection(CrawlRecord record)
url - The latest URL/page that needs human inspectionCrawlRecordpublic int getNumberPagesLeftToScrape()
CrawlRecordpublic List<CrawlRecord> getPagesProcessed()
CrawlRecordpublic List<CrawlRecord> getPagesProcessedAndUnprocessed()
CrawlRecordCopyright © 2025. All rights reserved.