Member-only story
The scale of the data has the largest impact on the accuracy of the pipelines.
This post focuses on three aspects of the data with these parameters.
Distributed systems
Performance
Consistency of data
Signal-to-noise ratio (S/N)
The “Signal-to-noise ratio” is a measure of the amount of data that is useful in proportion to all unnecessary data. For example, if you receive an e-mail, it is small (signal) compared to the noise in the e-mail (spam, irrelevant information). In the pipeline, the amount of data is large, and the noise must be carefully cleared (removal of irrelevant data; ie. data cleaning).
There are many definitions of S/N ratio. But in the case of large datasets, it is convenient to use it in terms of db. For example, if you build a web scraper, you usually receive 10'000 HTTP responses per hour, or over 1.5 TB per day. Therefore, the S/N ratio in the web scraper is 1/1.5T = 10'000.
Also, according to the signal-to-noise ratio, we can estimate the number of machines that are needed to crawl the internet for collecting data.
This article introduces three basic solutions for reducing the size of the data by discretization (the principle of point-processing; unique indication of data with point-processing).