In this section, we describe our workflow for extracting the data from the FCCs filing with our pre- and post-processing steps, and the data extraction by Rossum.ai.
Accuracy of our Document Extraction
The data extraction by Rossum reaches a dollar-weighted accuracy of 93.9% in identifying the correct gross total amount of a document.
After our manual correction through the Rossum User Interface, we know the start or end of the flight of a TV advertising for over 81% of all dollars spent.
Our outlier detection safely identifies SELECT
Our AI-based preprocessing (using a Random Forrest Classifier)
- determines the rotation angle and aligns all FCC documents vertically,
- splits all FCC pdf files into single documents (order, attachments, invoices, and others
with an accuracy of 96%
Rossum.ai offers an AI-powered Data Extraction Engine which we partner with to process the FCC filings.
We have manually annotated 2000 FCC documents to help Rossum train a Data-Extraction Model custom-fit to the FCC documents uploaded into the Political File: order, contracts, invoices, order worksheets, e-orders, and others.
The Rossum User Interface is a highly useful tool to both smoothly annotate the training data and correct already extracted data:
During post-processing, we identify incorrectly extracted fields and correct the data where possible automatically using a set of rule-based algorithms. If no automatic correction is possible, we identify suspicious extractions and re-upload the documents with the highest $$-impact to Rossum for manual correction.
Here are some examples of how this step is performed:
- Gross total, net total, and agency commission follow the equation gross total = net total + agency commission. If not equal, the correct amount can be calculated from those that match.
- The agency usually is 15% or 0% will never be 100% or above.
- For each callsign, a pattern of frequent spendings is calculated. We detect if there is a deviation from these patterns and double-check the data for correctness.
In our deduplication step, we match multiple documents that refer to the same advertising buy: e.g., orders, order worksheets, contracts, and invoices can refer to the same spot. Frequently, multiple revisions of contracts are uploaded. Detecting the connection between those documents is possible by using the invoice IDs and reference IDs that are extracted from the documents.
Deduplication helps to remedy incorrectly extracted data or missed extraction. For example:
- Orders, contracts, or invoices that refer to the same buy are simultaneously used to learn all fields even if they are missing on some documents.
- Correct values are determined by majority vote.
Updated 9 months ago