How to use S&P deep web, learning, learning and snow, snow, to collect 5x data on small and medium companies

Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more

The investment world suffers from a big problem when it comes to data on small and medium -sized companies (SMES). This has nothing to do with data quality or accuracy – it is not any data at all.

It has been difficult to assess the credit of small and medium -sized companies because the financial statements of small institutions are not general, and therefore it is difficult to access.

S & P Global Market IntelligenceDepartment of S&P Global and credit category provider, claiming to have solved this problem for a long time. The company’s technical team built RiskA platform that works from artificial intelligence, which crawls far -reaching data from more than 200 million sites on the Internet, and processes them through many algorithms and generates risk degrees.

It was built on the snowflake structure, the S&P coverage platform for small and medium companies increased by 5X.

“Our goal was to expand and efficiently,” Moody Hadi, President of S& P Global’s New Risk Solutions. “The project has improved accuracy and coverage of data, which benefits customers.”

Riskgauge basic architecture

The good credit department evaluates the credit and risk of the company based on several factors, including financial statements, and the virtual and appetite probability of risk. S&P Global Market Intelligence provides these ideas for institutional investors, banks, insurance companies, wealth managers and others.

“The entities of large and financial companies lend to suppliers, but they need to know the amount of lending, the extent of their monitoring, and what is the loan period,” Hadi explained. “They depend on third parties to reach a trustworthy credit degree.”

But there has always been a gap in covering small and medium companies. Hadi pointed out that although large public companies such as IBM, Microsoft, Amazon and Google and the rest are required to reveal their quarterly financial statements, small and medium companies do not have this commitment, which limits financial transparency. From the perspective of the investor, keep in mind that there are about 10 million small and medium -sized companies in the United States, compared to about 60,000 public companies.

S & P Global Market Intelligence claims that it now has all that covered: Previously, the company had only data about about 2 million, but Riskgauge expanded to 10 million.

The platform, which entered production in January, depends on a system designed by the HADI team that pulls the fixed data from the unorganized web content, collects it with unknown third -party data groups, and applies machine learning (ML) and Advanced algorithms To generate credit grades.

The company uses Snowfall To create and process the company’s pages to corporate operating programs (market sectors) that are then fed in risk.

The basic system data pipeline consists of:

Crawling/web signs
Pre -processing
Mines workers
Curators
Recording risk

Specifically, the Hadi team uses the Snowflake data and SNOPARK container services in the middle of pre -processing, mining and operating processes.

At the end of this process, small and medium companies are registered on the basis of a set of financial and commercial business risks and market; 1 being the highest, 100 lower. Investors also receive reports of risk separating financial statements, popular scientists, commercial credit reports, historical performance and major developments. They can also compare companies with their peers.

How do S&P collect the valuable company data

Hadi explained that Riskgauge employs a multi -layer bulldozing process that pulls different details of the company’s web field, such as “Contact us” and information related to basic news. Miners are less than URLs to detect relevant data.

Hadi said: “As you can imagine, no one can do this.” “It will be a consumer of time for humans, especially when you deal with 200 million web pages.” He indicated that it results in many TERABYTES information.

After collecting the data, the next step is to run algorithms that remove anything that is not a text; Hadi noted that the system is not interested in JavaScript or even HTML brands. The data is cleaned so that it becomes human reading, not a symbol. Then, it is loaded in Snowfall Several data mines are operated against pages.

The band’s algorithms are necessary for the prediction process; These types of algorithms combine predictions from many individual models (basic models or “weak learners” that are essentially better better than random guessing) to verify the health of the company’s information such as the name, the description of work, the sector, the location and operational activity. Factors also in any polarity in feelings about ads revealed on the site.

“After a site crawled, the algorithms hit different components of the pages that were pulled, voted and returned with a recommendation,” Hadi explained. “There is no human being in the episode in this process, the algorithms compete mainly with each other. This helps to efficiently increase our coverage.”

After this initial pregnancy, the system monitors the site activity, and the weekly scanning operations are operated automatically. Do not update information weekly; Only when the change is discovered, Hadi added. When performing later surveys, the retail key tracks the intended page of the previous crawl, and the system generates another key; If it is identical, no changes are made, and no procedure is required. However, if the retail keys do not match, the system will run to update the company’s information.

This continuous gain is important to ensure the system remains as possible. “If they update the site often, this tells us that they are alive, right?” Hadi pointed out.

Challenges with processing speed, giant data groups, and unclean web sites

There were challenges to overcome when building the system, of course, especially due to the huge size of data groups and the need for rapid treatment. The Hadi team had to make the bares to achieve a balance between accuracy and speed.

“We continued to improve the various algorithms to run faster,” he said. “And the switch; some of the algorithms that we really were good, had high accuracy, high accuracy, and high memory, but they were very expensive.”

Web sites are not always compatible with standard formats, which require flexible ways.

Hadi said: “You hear a lot about designing web sites with an exercise like this, because when we originally started, we thought,” Hey, each site must match the Sitemap map or XML. “Guess what? No one follows it. “

They do not want to cord with or integrate automation of automatic operations (RPA) into the system because the sites differ very widely, as Hadi said, and they knew that the most important information they needed is in the text. This created a system that only withdraws the ingredients needed for a location, and then cleans it for the actual code, the ignorant symbol, and i JavaScript or Typescript.

Hadi also noted, “It was the biggest challenge about performance, control, and the fact that the websites, by design, were not clean.”

Daily visions about business use cases with VB daily

If you want to persuade your boss at work, you have covered VB Daily. We give you a scoop from the inside about what companies do with obstetric artificial intelligence, from organizational transformations to practical publishing operations, so that you can share visions of the maximum return on investment.

Read with us privacy policy

Thanks for subscribing. Check more VB news bulletins here.

An error occurred.