Signals
Get to know Signals field concepts and output data formats
Last updated
Was this helpful?
Get to know Signals field concepts and output data formats
Last updated
Was this helpful?
Signals are a unique type of field which leverage previously extracted data to derive aggregated metrics or values. Depending on the signal type, you may compute statistics of your own document data set or a shared pool of aggregated information.
Sample use-cases for signal field types include:
Comparing the Total on an invoice to the historical average to detect anomalies
Detecting document duplicates by searching for old data matching key fields
Computing the probability of observed fields combinations to detect potential fraud
Signals are currently available for access via private Beta.
If you'd like to get started with signals on your data, please !
They come in two forms:
Pre-configured signals models for a specific use-case, e.g.
Custom signals configured manually via
Probability models compute the probability of observing the certain field values with respect to others that have been extracted from the document. This type of field can detect documents with unusual details and forms the basis of the fraud signal field provided by Sypht. The Sypht fraud signal field can detect common invoice fraud where a third party replaces legitimate payment information for a known invoice issuer with their own payment details. These new details will not match historical payment information for this issuer and therefore return a low probability of being legitimate.
In the example below, we calculate a fraud signal based on the probability of the observed bank account details (BSB and accountNo) appearing given the Australian Business Number (ABN).
We define 2 sets of fields to compute a probability or fraud detection signal
conditioned fields: the set of fields whose values are common to all documents and form the underlying set of documents from which we can derive probability estimates for the observed field values. In the above example this is the ABN of the issuer.
observed fields: the set of fields being observed from which we derive a probability estimate given the conditioned field values being present. In the example above these are the payment details (BSB and accountNo) that could be altered in a fraudulent document.
The fraud signal field will calculate a float
value indicating the probability P(observed|conditioned)
computed from historical data. Using the above example this is: "the probability of a document with a certain ABN also having the observed BSB and accountNo with respect to all other documents that have the same ABN". The complement of this probability: 1 - P(observed|conditioned)
is computed and rounded to nearest 0.05 to give the final signal value as the likelihood of fraud. (NOTE: the probability value is smoothed.)
For example, if there are a 1000 documents with a certain ABN and only 50 of those documents have a given BSB and accountNo, then the signal will be 1 - 50/1000 = 0.95
(without smoothing). This is rounded to nearest 0.05 to give 0.95
(no change). The value may be displayed as percentage which would be 95%
. A signal above 70%
would be considered a strong enough signal to warrant inspection.
It's important to take into consideration the confidence of the signal as it takes into account the number of documents from which the probability calculation is derived. If there is one document, the signal will be 0%. For 2 documents the signal will be 35% (with smoothing included). In these situations there is clearly not enough data to determine if a document is unusual or fraudulent. As a result, the confidence will be very low to reflect this fact. See the next section for how the confidence of the signal is calculated.
The signal output includes a confidence field that takes a value from 0 to 1 indicating the degree of overall confidence in the signal. This value is a function of:
The mean of the confidence values given by the models that extracted the observed
and conditioned
fields of the document the signal is used on. Lower confidence in the extracted values (the ABN, BSB or accountNo using the above example) will result in a lower value.
For example if the ABN, BSB and accountNo have confidence values of 0.8809, 0.9998 and 0.9995, the mean will be 0.9601 .
The total number of documents used when calculating the probability, which is the number of documents with the given conditioned
field values. Using the above example, this would be all documents with the same ABN. The smaller this number, the more the confidence value calculated in (1) is reduced.
No reduction in confidence is made if there are 1000 or more documents.
For 500 documents, the confidence is reduced by ~10%.
For 100 documents, the confidence is reduced by ~33%.
For 10 documents, the confidence is reduced by ~67%.
The signal output includes a support field that can take on 3 values to provide an indicator of the total number of documents used in calculating the signal:
HIGH - the estimate is based on enough documents that we can be confident about the probability.
MEDIUM - there are enough documents that the probability signal can be treated as an indicator.
LOW - there might not be enough documents for the probability calculation to provide an accurate indicator.
Document match models dynamically search and match previously uploaded documents based on values extracted on a query document. This can be used to power a 3-way match of invoice to purchase and delivery documentation; or in the example below, to detect near-duplicates such and avoid the invoice being processed multiple times:
fields
a list of field IDs o match against
exact
boolean value indicating whether to use fuzzy or exact value matching
Document match searches all previously uploaded documents where the match fields have been extracted.
A list
of documentreference
values. One for each matched document.
For example:
Statistical signals return basic metrics for numerical field values with respect to historical data. This can be used for data analysis and benchmarking of extractions against historical values or market data precedents.
source
the field identifier to compute historical metrics for. Only fields with numerical data types are accepted.
A dictionary containing descriptive metrics including:
min
the minimum observed value of this field in previously extracted data
max
the maximum observed value of this field in previously extracted data
avg
the average or mean observed value for this field in previously extracted data
variance
the numerical variance of this field in previously extracted data
percentile_rank
the percentile rank of the observed value observed with respect to previously extracted data
For example, configuring a value statistic signal over the invoice.total
field, you may asses the percentile_rank
of the Total on an invoice and use this in combination with to dynamically flag invoices with unusually high values (e.g. 99th percentile) for human review.