Google’s Graph for Understanding Artifact Composition, also known as GUAC, is a free tool designed to bring together various sources of software security metadata into one graph database. According to Google, this free service helps businesses better comprehend their software supply chains.
Companies can access data about their software supply chains from a variety of sources, such as software bills of materials (SBOMs), signed attestations about how software was built and cross-database vulnerability databases. With this tool, companies are able to make this information freely accessible and useful for organizations around the world.
Metadata Collection
Metadata is data that uniquely identifies, describes and facilitates search and retrieval of other information. It plays a vital role in data warehouse and business intelligence systems; hence the importance of having metadata that shares structural similarity with other records. Metadata typically follows a structured data model developed using methods like entity-relationship diagramming.
Metadata should provide a high level of standardization, describe the origin and transformations of data, and assign credit for products it references. This is particularly pertinent in security contexts where it’s essential to know who created an artifact or software library and why.
Google’s open source project GUAC, which seeks to create a graph database for tracking software build, security and dependency metadata, has begun the collection phase of its metadata process. GUAC strives to democratize access to valuable software security metadata by making it available for all organizations regardless of their IT budgets or enterprise-scale security infrastructures.
To achieve its objectives, GUAC is seeking talented individuals from throughout the tech community. It is currently recruiting engineers from Google, Kusari, Purdue University, Citi and other technology vendors.
The GUAC team will endeavor to create a database containing software build, security and dependency metadata and make it available through the GUAC portal. This will enable IT and security leaders to better comprehend their risks so they can swiftly and confidently address incidents involving threats to their organizations.
Once GUAC is up and running, all collected data will be organized into an in-memory graph of targets and their dependencies. This graph can then be queried against to return details such as SBOM (Short Brief On Measurement), provenance (build chain), project scorecard, vulnerabilities and recent lifecycle events for an artifact and its transitive dependencies.
Queries to the graph can have higher-level organizational effects such as audit, policy, risk management and developer assistance. Furthermore, it provides a secure base for analyzing vulnerability signatures and their blast radiuses – which may be necessary in assessing how much impact a vulnerability could have in an supply chain.
Data Ingestion
Data Ingestion is the process of gathering raw data from various sources into a central repository to be further analyzed. It’s an integral step in data operations for companies seeking valuable insights from their raw data.
Data ingestion is the practice of collecting various data types and formats from multiple sources and transferring them to a central storage location for analysis and processing. This step is commonly taken by many corporations to gain valuable insights from their raw data sets.
Data ingestion can be done either real time or batch processing, depending on your business needs. Some tools offer both options so it’s essential to select one that best meets your requirements.
Batch Processing: Batch ingestion can be especially helpful for applications that need to ingest large amounts of data at a later stage (e.g., loading data at regular intervals or after an event).
However, reliability and cost should be taken into account when using this method of ingestion. If the data provided is unreliable or prone to errors, then this type of ingestion may not be suitable for your application.
Lambda-based Data Ingestion: This type of ingestion is ideal for time-sensitive data, since it ingests information in smaller pieces that can be quickly accessed when needed. This makes the entire process more efficient and cost effective than other data ingestion methods.
To ensure a successful data ingestion strategy, it’s important to comprehend the types, formats and frequency of inputted data. This will enable you to choose an appropriate tool and guarantee your process runs as smoothly as possible.
Additionally, you should take into account any security and compliance risks that could impact your business when selecting an ingestion tool. Doing this helps you prevent data losses or unauthorized access.
Data Assembly
Google’s new open source project, Graph for Understanding Artifact Composition (GUAC), collects and synthesizes software security metadata into a high-fidelity graph database. It normalizes entity identifiers and maps standard relationships between them, making it simple to explore build, security and dependency information in an organized fashion.
GUAC collects data from multiple sources, such as software bills of materials (SBOMs), supply chain levels for software artifacts (SLSAs), provenance and vulnerabilities. This provides an expansive view of these crucial data sets and helps organizations better protect themselves against attack.
Google says the technology powering GUAC is Neo4j, a graph database designed for efficient mapping of complex relationships between entities. Through this knowledge graph architecture, GUAC can ingest data from various source systems and query its engine to retrieve details about an artifact’s SBOM, provenance, build chain, project scorecard and vulnerabilities.
Historical analysis can also provide valuable insight into how a vulnerability was exploited in the past. This allows organizations to determine whether an issue was caused by one specific actor and how it may have affected their entire software supply chain ecosystem.
In addition to offering a central view of software security metadata, GUAC also serves as an instrument for organizations to track their progress with security policies and remediation initiatives. Furthermore, it could allow businesses to identify which parts of their software supply chains are most vulnerable and how best to protect them against future attacks, according to the tech giant.
By providing a single, integrated and scalable system for analyzing the security of an entire software supply chain, organizations will have an easier time safeguarding their investments. It also makes it simpler for developers, security teams and auditors to obtain information regarding the security, provenance and trustworthiness of their software artifacts.
GUAC is still in its early stages, and needs more development before becoming widely useful. Once a strong community arises, more security professionals will have access to this tool and its advantages.
Query
There’s a need for a graph database of software artifacts that can track their provenance, security and dependencies. Google is trying to address this need with its GUAC open source project, which it says will change how the industry views software supply chains.
Unlike traditional databases, GUAC uses a knowledge graph to store artifact metadata. Such systems are gaining popularity among IT management tools as cloud-native applications become increasingly distributed, ephemeral and dense, because they can efficiently map complex relationships between data sets.
To create a graph, GUAC ingests raw metadata from disparate upstream sources, including OSVs, first-party internal repositories and third-party solutions like data vendors’ own internal systems. It then assembles this data into a coherent graph by normalising entity identifiers, traversing the dependency tree and reifying implicit entity relationships, e.g., between project and developer, between vulnerability and software version, between artifact and source repository, and so on.
Once GUAC has the required metadata, it can be used to perform queries that will return data about the artifacts that make up the graph. For example, a user can query the graph to determine the SBOM, provenance, development chain, project scorecard and vulnerabilities for a given artifact.
This information is crucial for CISOs and other stakeholders in the software supply chain, as it can reveal weak points, risky dependencies and whether binaries can be traced to a securely managed repository. This will allow CISOs to identify these vulnerabilities and find ways to prevent compromises.
While a number of tools and services have been created to aggregate and synthesize software security metadata, GUAC is unique in that it provides a graph database for storage, analysis and visualization of this data. The graph database, built on Neo4j and accessed through a GraphQL API, allows users to access metadata that has been stored remotely or locally.
Using GUAC, a user can easily determine the most frequently used critical components in the software supply chain, weak points and risky dependencies. They can also use the graph database to analyze their current software supply chain security posture and improve it over time.
Recommended readings:
- What is Metadata and How it is Used
- What is Culture in Sociology?
- What is Meta?
- Types of Graph Databases
Â
