π€Google Leaks: How Google Works?
These are my speculations on how Google works based on Google leaks data as well as other available documents and sources (work in progress).
Last Updated: 13.06.2024
Status: work in progress.
Annotation: "Google leaks" is 14k attributes (features) within the Google Content Warehouse API that leaked to the web May 7 and were reported May 27. These attributes are given in no particular order with little indication to their current validity and context of usage. It is important because for the first time we have a list of all attributes that may take part in ranking called by their internal names. Despite the lack of weight attached to each of them and obviously the context of it's usage (connection to other ones), we can see what are the signals that Google may use for ranking and compare them to other known information: Google employees testimony in the ongoing Antitrust court proceedings (DOJ antitrust testimony), Google patents as well as data of relevant experiments.
High level Summary*
*all presumption are my own (speculative), I can't guarantee their validity
Google uses 2 ranking frameworks: primary ranking system as well as re-ranking framework. The primary rankings algorithm called AScorer, which is a part of Mustang, the primary scoring, ranking, and serving system. On top of Ascorer there are re-ranking adjustments collectively called Twiddlers. The Twiddler framework supports hundreds of individual twiddlers. Each of these twiddlers operates in isolation, without knowledge of the decisions made by others, whereas they get reconciled on the framework level. Twiddlers are plugged-in to the "superroot", the brain of Google Search that sends messages to Googleβs servers, i.e. work directly onto the search.
Google uses 3 types of repositories for documents [intended for ranking]: 1st tier: - fresh and often retrieved documents, 2nd - documents that updated rarely and so retrieved (system that places documents into tiers within the index is called SegIndexer), and 3rd - documents that live on disk long term (secondary indexing system for these documents is called TeraGoogle).
There are following assignment systems: crawling, indexing, rendering, processing, ranking and serving. In ranking Google superroot send request to thousand of servers simultaneously and each one individually performs it's part of ranking function and their results get "joined" and weighed before serving, whereas the whole process takes 0.25 second on average.
Primary scoring and ranking goes on on a per document level but includes site-level data. On a per document level scored are page quality data, spam(ness), semantic data (discovered entities), freshness of the content, pagerank, NSR ( Neural Semantic Retrieval) data, YMYL (Your Money or Your Life, i.e. Sensitive content) scores, number of URLs, content length, etc. In Mustang and TeraGoogle there also included messages containing per doc [quality] signals that are compressed (called compressed quality signals) . Among them are indications to Promote and Demote documents, including what we knew as previous "Google Updates" - Panda (as well as Baby Panda), Product Review Promote/Demote, etc, which might read as "included in the core algorithm".
Per document data refer to Quality component (aka Q* or QStar): a group of web page quality signals as well as a method thereof made in order to quickly run length encoding (RLE) with new site components. Uses scores learned from NSR (Neural Semantic Retrieval) method to better understand "page rather than just keywords", earlier called a critical part in document retrieval. Q* contains scores on Low Quality, Site Authority, Scamness, Topic Embeddings, etc as well as NSR override bid used in Q* for emergency overrides. So, Q* relates web page quality scores based on understanding the content and organised as RLE sequence.
Embeddings or high-(64 and more) dimensional vectors are used to compress and encode page and site semantic data. Topic Embeddings, e.g. closeness [of a document or whole site] to a topic as expressed in search query is a part of Compressed Quality Signals and also used to detect and fight spam.
Overall, Google arguably heavily uses NSR at indexing stage too. At this stage NSR is used to define different aspect of quality on a chunk [of document], document, cluster [of documents] as well as site-level, etc including scoring using the following content quality predictors: Keto score (arguably, content "meat" or substance) versus Chard score (arguably, alternative source of "protein") as well as "ToFu" score. In also gives score to whether or not title matches user queries, calculates and stores aggregated value of "link out" scores of site pages, locality score [as derived by means of NSR] as well as Clutter Score - penalizing sites with a large number of distracting/annoying resources loaded by the site. Finally, NSR is responsible for Site Authority value, Default NSR (for new pages) and Confidence level (implies the level of confidence in NSR data calculations). The scores then used on the stage of joining the document's data versions and so forth.
Twiddlers are responsible for the re-ranking of results from a single corpus. They act on a ranked sequence of results rather than on individual results. Twiddlers may work on devise-basis, location-basis, topic-basis, etc. Google have Boost (or Demote) functions that are a part of Twiddlers framework. Boosts functions identified in the docs: NavBoost, QualityBoost, RealTimeBoost, WebImageBoost and more.
NavBoost is a re-ranking module that uses a pair of dice metaphor and logic. Presumably, modules uses click and impression (and their proportions) as a "winning" dice combination per specific position in SERP: if document get's a better combo per position than other, it gets a boost. People who navigate the search and choose specific document called "voters", whole process - "voting", people' data are tokenised and stored. This Twiddler works to Boost (promote) or Demote sites.
Signals and data that relate to links and links anchors are numerous, which reflect the immense capacity of Google and detect and fight spam (a reference to Spambrain component). Findings suggest that there are not principal distinctions between internal (local) links and external ones. Google weighs link quality by indexing tier giving priority to "newsy" (new and classy) documents as well as implies document source quality from indexing tier itself. Arguably, Google used Trusted source database (Seed data) and compares the data to it [where relevant]. PageRank arguably exists on several layers including that which implies a "distance" from a golden standard "seed" sites.
General recommendations
In the view of Google abilities to understand content (ngrams = keywords, NSR, Embeddings, anchor signals), it's important to adhere to:
Max relevancy of content page to a page title as well as implied (and targeted) user-intent; page titles are still important too: must contain keyword(s), indication of intent as well as site info (brand-related).
Semantic-rich content, including: text quality, quality of sites linked to (sources), mentioning expert, closely related and recognised entities.
Content freshness: regularly & significantly update content, keep the date on page too.
Produce geo-relevant content and documents that have high "locality" (as relevant).
Also, preventing spam shall be a top priority:
Link audit to detect possibly link spam,
Stick to a zero tolerance to gibberish content and scaled content spam, e.g. creating many pages where the content makes little or no sense to a reader but contains search keywords,
Gain links from "newsy" sources that are constantly updated: news sites, etc - Digital PR.
Recommendation specific to discovered ranking signals (referenced attributes described in details below in Group of Modules or Terms):
Pay attention to results/query matching, i.e. whether pages match to the perceived needs (intent) of the users [who do query] as it may boost site in rankings, if no, update or remove content (referencing SiteBoost)
Spot and investigate emerging trends in clicks, impressions, CTR per patterns of pages - these may predict imminent change in traffic (boosts / demotions) and try to update content with regards to Top-10 competitors pages (referencing NavBoost component)
Spot and promote internally content that shows great users metrics (click to impressions, CTR) or/and great quality as it may result in boost in ranking (referencing to FireFly component).
Spot and investigate into unusual relations between links, clicks and keyword: mind high number of links versus low number of keywords and traffic (referencing to Panda).
Optimise Core Web Vitals (referencing VOLT)
Keep site's content within the core focus of the site, avoid topics that stretch too far from the site's focus (referencing Embeddings).
Mind site authority: create a topical authority (topical in-depth content), select reputable sources to link out to, build citation within the topic (inbound links), mind semantic data (entities).
Spot decay / old content by looking at constantly decreasing # of clicks and CTR - they may predict a future downtrend in clicks to whole pattern of pages, significantly update them to avoid this (Referencing to Last Good Click).
Named Systems
Crawling
Trawler β The web crawling system. It features a crawl queue, maintains crawl rates, and understands how often pages change.
Indexing
Alexandria β The core indexing system.
SegIndexe r β System that places documents into tiers within the index.
TeraGoogle β Secondary indexing system for documents that live on disk long term.
Rendering
HtmlrenderWebkitHeadless β Rendering system for JavaScript pages. Oddly this is named after Webkit rather than Chromium. There is mention of Chromium in the docs, so Itβs likely that Google originally used WebKit and made the switch once Headless Chrome arrived.
Processing
LinkExtractor β Extracts links from pages.
WebMirror β System for managing canonicalization and duplication.
Ranking
Mustang β The primary scoring, ranking, and serving system
Ascorer β The primary rankings algorithm that ranks pages prior to any re-ranking adjustments.
NavBoost β Re-ranking system based on click logs of user behavior.
FreshnessTwiddler β Re-ranking system for documents based on freshness.
WebChooserScorer β Defines feature names used in snippet scoring.
Serving
Google Web Server β GWS is the server that the frontend of Google interacts with. It receives the payloads of data to display to the user.
SuperRoot β This is the brain of Google Search that sends messages to Googleβs servers to and manages the post-processing system for re-ranking and presentation of results.
SnippetBrain β The system that generates snippets for results.
Glue β The system for pulling together universal results using user behavior.
Cookbook β System for generating signals. There is indication that values are created at runtime.
Groups of Modules
Here go description of the groups of modules found as well their entries (individual modules).
Twiddlers
Twiddlers (general) are [multiple] re-ranking functions that are used on superroot level. There are PRE_DOC and POST_DOC twidllers. PRE_DOC = twiddlers firing before the DocInfo request is sent to the mustang backend . Arguably, have a major impact on Google [and YouTube search]. Twidllers are responsible for the re-ranking of results from a single corpus. Act on a ranked sequence of results rather than on individual results. Number of twiddlers: unknown, probably in hundreds now (64 in 2018).
Types of Twiddlers:
Devise: boost mac software results (.dmg files) for iOS users [specifically].
Location: if your physical location is in an area affected by a natural catastrophy (e.g., wild fires), twiddlers could be responsible for boosting news results.
Topic: certain topics and misinformations could be affected by twiddlers, e.g., content of a certain party or viewpoints could be promoted/demoted, example: Covid19 topics.
Social graph: this is used in one of the superroot's PRE_DOC twiddlers as a lookup key for the full Social Search data. PRE_DOC = twiddlers firing before the DocInfo request is sent to the mustang backend.
Freshness (RealTimeBoost): used in FreshnessTwiddler uses to reward fresh content.
Navigation: see NavBoost below.
Quality: different quality, domain and anchor based signals.
SiteBoost: used used to promote/demote site that are doing best in matching results/query. etc.
Social search: to promote best satisfying social search results.
Sandbox: used to sand box fresh [domain] spam.
WebImageBoost: used to boost /demote images, e.g. based on safe search signals.
The Twiddler framework - principles:
Isolation: Unlike Ascorer, which has a few complex algorithms, the Twiddler framework supports hundreds of individual twiddlers. Each of these twiddlers operates in isolation, without knowledge of the decisions made by others.
Interaction Resolution: Since twiddlers operate independently, they provide recommendations and constraints on ranking changes. The framework then reconciles these suggestions.
Context Provision: The framework offers safe, read-only access to the context in which results are being twiddled.
Ease of Experimentation: Due to their integration within Superroot, itβs often simpler to run ranking experiments using a twiddler.
Quality Navboost (aka NavBoost)
Boost in ranking position as a result of user navigation data (clicks and impressions). A part of Twiddler framework. Found 64 entries. Has forms of CrapsCraps: Data, Device, Click signals as well as Craps Feature: CrapsData and also CrapsStats: With Weights Proto. Original names: QualityNavboostCrapsCrapsClickSignals, QualityNavboostCrapsCrapsDevice, QualityNavboostCrapsCrapsData.
CRAPS arguably stands for "Clicks and Results Prediction System". Learn more in CRAPS, Vocabulary term.
Concerns search and image. Supposedly, data modules are used to contextualise the click signals whereas stats includes statistical information as well as LO (learning outcomes).
Confirmed in DOJ antitrust testimonies, etc NavBoost signifies, depending on the context 1) a number of click on page and/or 2) re-ranking system that uses number of clicks (bad clicks, good clicks, clicks, last longest clicks, and site-wide impressions) of search queries (called reference queries) [including] per search position in Google to Boost / Demote sites in ranking. Implications: clicks as a measurement of success of a page suggest that pages will low number of clicks (versus the site's average) has to be removed from site, the same concerns impressions. Ranking signal: yes. Power: high or highest.
Signals
Good Clicks
Certain proportion of Clicks to Impressions (arguably, [high] CTR). Implications: shows user willingness to click the specific document (amidst the others). Also, Absolute impressions (unsquashed), Impressions, Clicks, Bad Clicks, etc.
Last Good Click
Indicates the date when this document received the last good click. Implications: This may suggests that content decay (or traffic loss over time) is also a function of a ranking page not driving the expected amount of clicks for its SERP position. Related: Last Longest Click - the number of clicks that were last and longest in related user queries, unicorn Clicks - The subset of clicks that are associated with an event from a Unicorn user.
Data
Country, Devise, Features: Contains CrapsClickSignals for specific features. (i.e. for mobile, US, metro id β 123β³), language.
Level of pattern (patternLevel)
Level of pattern where NavBoost operates, higher level (general) patterns (sub-domain, folders, etc) get higher values [in the scoring]. Implications: suggested that clicks as a measurement of success is applied on patterns (group of pages), which informs their [group of pages] perceived value to users [collectively]. Also, includes URL and query.
Voter [tokens]
T he documentation may represent users as voters and their clicks are stored as their votes.
Compressed Quality Signals
Messages containing per doc [quality] signals that are compressed and included as per-doc in Mustang and TeraGoogle (for per-doc data see below). May read as "included in the core algorithm". Includes indications to Promote and Demote [outside of NavBoost component]. Found 41 attributes, including:
Indications to Promote and Demote [outside of NavBoost component]
product Review Promote / Demote Page (site): Product review demotion/promotion confidences. Possibly, Google stores here signal from Product Review functions [what's known as Product Review Update]. Also includes: productReviewPReviewPage: the possibility of a page being a review page , productReviewPUhqPage: the possibility of a page being a high quality review page.
exact Match Domain Demotion
Panda demotion
baby Panda Demotion: New BabyPanda demotion, applied on top of Panda.
authorityPromotion - authority promotion: converted from QualityBoost.authority.boost
anchorMismatchDemotion - demotion for mismatch of link anchor to the destination site: converted from QualityBoost.mismatched.boost. Possibly, a way to fight domain [acquisition and] redirection spam.
Indications to Promote and Demote from NavBoost component
navDemotion (converted from QualityBoost.nav_demoted.boost)
crapsNewUrlSignals: contain craps signals for new URLs / patterns of URLs.
crapsAbsoluteHostSignals - Impressions, unsquashed, host level, not to be used with compressed ratios. Not to be used in Pattern Data.
NSR-related functions:
nsrConfidence
Includes references to experimental Q* component , [created] in order to quickly run LEs with new site components. Includes :
lowQuality (learned through NSR, applied in Q star component).
siteAuthority - site_authority: converted from quality_nsr.SiteAuthority, applied in Qstar.
nsrOverrideBid - NSR override bid, used in Q* for emergency overrides.
serpDemotion - serp demotion: applied in Qstar.
experimentalQstarSignal
experimental_nsr_team_wsj_data field: The purpose of this field is to be read by an experimental Q* component, in order to quickly run LEs with new components. See go/0DayLEs for details.
scamness: scam model score. Used as one of the web page quality qstar signals. Value range from 0 to 1023.
topicEmbeddingsVersionedData - Versioned TopicEmbeddings data to be populated later into superroot / used directly in scorers.
unauthoritativeScore - Unauthoritative score. Used as one of the web page quality qstar signals.
Page Quality Signals
pqData (type: integer(), default: nil) - Encoded page-level PQ signals.
Per-Doc Data
PerDocData: this group of attributes is used for building per-doc data in both Mustang and Teragoogle that [data] needed during serving. Relates different types of documents: media (images and video), site pages. Arguably, is a basic repository of data used for serving search results. May be classified as follows:
Compressed Quality Signals
compressedQualitySignals (type: GoogleApi.ContentWarehouse.V1.Model.CompressedQualitySignals.t, default: nil) - m .
Spam
uacSpamScore - The uac spam score is represented in 7 bits, going from 0 to 127. Threshold is 64. Score >= 64 is considered as uac spam. Presumably, helps detect "shell" spam.
pspamtokensContentScore - For SpamTokens content scores. Used in SiteBoostTwiddler to determine whether a page is UGC Spam. Presumably, helps detect UGC spam [comments on page, etc].
spamMuppetSignals - Contains hacked site signals which will be used in query time joins.
trendspamScore - For now, the count of matching trendspam queries.
ScaledSpamScoreYoram - Spamscores are represented as a 7-bit integer, going from 0 to 127.
spambrainData - Site level scores coming from spambrain.
DocLevelSpamScore (type: integer(), default: nil) - The document spam score is represented as a 7-bits, going from 0 to 127.
extraData (type: GoogleApi.ContentWarehouse.V1.Model.Proto2BridgeMessageSet.t, default: nil) - Currently this field contains:
UrlSignals for the document level spam classifier (when the doclevelspamscore is set).
PerDocLangidData and
realtimespam::ClassifierResult for the document level fresh spam classifier (when the doc-level fresh spam score is generated).
MicroblogDocQualitySignals for document-level microblog spam classifier. This only exists in Firebird for now. spam_buckets::BucketsData for a document-structure hash
urlPoisoningData (type: GoogleApi.ContentWarehouse.V1.Model.UrlPoisoningData.t, default: nil) - Contains url poisoning data for suppressing spam documents.
KeywordStuffingScore (type: integer(), default: nil) - The keyword stuffing score is represented in 7 bits, going from 0 to 127.
spambrainTotalDocSpamScore (type: number(), default: nil) - The document total spam score identified by spambrain, going from 0 to 1.
spamrank: The spamrank measures the likelihood that this document links to known spammers. Its value is between 0 and 65535.
gibberishScore: The gibberish score is represented in 7 bits, going from 0 to 127.
IsAnchorBayesSpam (type: boolean(), default: nil) - Is this document considered spam by the anchor bayes classifier?
SpamWordScore (type: integer(), default: nil) - The spamword score is represented in 7-bits, going from 0 to 127.
Semantic data
webrefEntities - WebRef entities associated to the document. Relates Google patent on "Systems and methods for selecting content using webref entities" https://patentimages.storage.googleapis.com/99/16/b2/66c212fad241fb/WO2014008654A1.pdf. Re: entities, semantic search.
semanticDateinfo
nsrDataProto (type: GoogleApi.ContentWarehouse.V1.Model.QualityNsrNsrData.t, default: nil) - Stripped site-level signals, not present in the explicit nsr_* fields, nor compressed_quality_signals.
mediaOrPeopleEntities (type: GoogleApi.ContentWarehouse.V1.Model.ImageQualitySensitiveMediaOrPeopleEntities.t, default: nil) - Contains the mids of the 5 most topical entities annotated with selected KG collections. This information is currently used on Image Search to detect cases where results converged to mostly a single person or media entity. More details: go/result-set-convergence.
Freshness
datesInfo (type: String.t, default: nil) - Stores dates-related info (e.g. page is old based on its date annotations). Used in FreshnessTwiddler. Use encode/decode functions from quality/timebased/utils/dates-info-helper-inl.h
semanticDateConfidence (type: integer(), default: nil) - DEPRECATED: semantic_date_confidence replaced by semantic_date_info.
lastSignificantUpdate - Last significant update of the document. This is sourced from the quality_timebased.LastSignificantUpdate proto as computed by the LSUSelector from various signals. The value is a UNIX timestamp in seconds.
freshnessEncodedSignals (type: String.t, default: nil) - Stores freshness and aging related data, such as time-related quality metrics predicted from url-pattern level signals. Use the encoding decoding API in quality/freshness/docclassifier/aging/encoded-pattern-signals.h This field is deprecated.
freshboxArticleScores (type: integer(), default: nil) - Stores scores of freshness-related classifiers: freshbox article score, live blog score and host-level article score. The encoding/decoding API is in quality/freshness/freshbox/goldmine/freshbox_annotation_encoder.h. To use this field, you MUST join g/pq-classifiers-announce and add your use case at http://shortn/_RYXS2lX2IV.
isHotdoc (type: boolean(), default: nil) - Set by the FreshDocs instant doc joiner. See //indexing/instant/hotdocs/README and http://go/freshdocs-hotdocs.
timeSensitivity (type: integer(), default: nil) - Encoded Document Time Sensitivity signal.
PageRank
pagerank (type: number(), default: nil) - Experimental pageranks (DEPRECATED; only pagerank in MustangBasicInfo is used).
pagerank1, pagerank2
homepagePagerankNs - . NS = Nearest Seeds. The page-rank of the homepage of the site. Copied from the cdoc.doc().pagerank_ns() of the homepage.
crawlPagerank (type: integer(), default: nil) - This field is used internally by the docjoiner to forward the crawl pageranks from original canonicals to canonicals we actually chose; outside sources should not set it, and it should not be present in actual docjoins or the index.
toolbarPagerank (type: integer(), default: nil) - A copy of the value stored in /namespace/indexing/wwwglobal//fakepr/* for this document. A value of quality_bakery::FakeprUtils::kUnknownToolbarPagerank indicates that we don't have toolbar pagerank for this document. A value between 0 and 10 (inclusive) means that this is the toolbar pagerank of the page. Finally, if this value is not set it means that the toolbar pagerank is equivalent to: quality_bakery::FakeprUtils::EstimatePreDemotionFromPagerankNearestSeeds( basic_info.pagerank_ns()) called on the MustangBasicInfo attachment for the same document.
hostNsr (type: integer(), default: nil) - Site rank computed for host-level sitechunks. This value encodes nsr, site_pr and new_nsr. See quality_nsr::util::ConvertNsrDataToHostNsr and go/nsr.
YMYL
ymylHealthScore (type: integer(), default: nil) - Stores scores of ymyl health classifier as defined at go/ymyl-classifier-dd. To use this field, you MUST join g/pq-classifiers-announce and add your use case at http://shortn/_nfg9oAldou.
ymylNewsScore (type: integer(), default: nil) - Stores scores of ymyl news classifier as defined at go/ymyl-classifier-dd. To use this field, you MUST join g/pq-classifiers-announce and add your use case at http://shortn/_nfg9oAldou.
Domain data
domainAge (type: integer(), default: nil) - 16-bit
numUrls (type: integer(), default: nil) - Total number of urls encoded in the url section = # of alternate urls + 1
hostAge (type: integer(), default: nil) - The earliest firstseen date of all pages in this host/domain. These data are used in twiddler to sandbox fresh spam in serving time. It is 16 bit and the time is day number after 2005-12-31, and all the previous time are set to 0.
Geo
pageregions (type: String.t, default: nil) - String that encodes the position ranges for different regions of the document. See "indexer/pageregion.h" for an explanation, and how to decode the string
brainloc (type: GoogleApi.ContentWarehouse.V1.Model.QualityGeoBrainlocBrainlocAttachment.t, default: nil) - Brainloc contains location information for the document. See ariane/273189 for details.
Content number
bodyWordsToTokensRatioTotal (type: number(), default: nil) -
titleHardTokenCountWithoutStopwords (type: integer(), default: nil) - Number of hard tokens originally in title without counting the stopwords.
Content Categorisation
topPetacatTaxId (type: integer(), default: nil) - Top petacat of the site. Used in SiteboostTwiddler to determine result/query matching. Arguably, Peta refers to "Petals".
scaledSelectionTierRank (type: integer(), default: nil) - Selection tier rank is a language normalized score ranging from 0-32767 over the serving tier (Base, Zeppelins, Landfills) for this document. This is converted back to fractional position within the index tier by scaled_selection_tier_rank/32767.
queriesForWhichOfficial (type: GoogleApi.ContentWarehouse.V1.Model.OfficialPagesQuerySet.t, default: nil) - The set of (query, country, language) triples for which this document is considered to be the official page. For example, www.britneyspears.com would be official for ("britney spears", "us", 0) and others (0 is English).
onsiteProminence (type: integer(), default: nil) - Onsite prominence measures the importance of the document within its site. It is computed by propagating simulated traffic from the homepage and high craps click pages. It is a 13-bit int.
commercialScore (type: number(), default: nil) - A measure of commerciality of the document Score > 0 indicates document is commercial (i.e. sells something) Computed by repository/pageclassifiers/parsehandler-commercial.cc
Other
asteroidBeltIntents (type: GoogleApi.ContentWarehouse.V1.Model.QualityOrbitAsteroidBeltDocumentIntentScores.t, default: nil) - For indexing Asteroid Belt intent scores. See go/asteroid-belt for details.
PremiumData (type: GoogleApi.ContentWarehouse.V1.Model.PremiumPerDocData.t, default: nil) - Additional metadata for Premium document in the Google index.
appsLink (type: GoogleApi.ContentWarehouse.V1.Model.QualityCalypsoAppsLink.t, default: nil) - AppsLink contains Android application IDs in outlinks. It is used to improve results ranking within applications universal. See http://go/apps-universal for the project details.
nsrSitechunk (type: String.t, default: nil) - SiteChunk computed for nsr. It some cases it can use more information than just url (e.g. youtube channels). See NsrAnnotator for details. If sitechunk is longer than --populate_nsr_sitechunk_max_length (default=100), it will not get populated. This field might be compressed and needs to be decoded with quality_nsr::util::DecodeNsrSitechunk. See go/nsr-chunks for more details. This field contains only nontrivial primary chunks.
ScaledIndyRank (type: integer(), default: nil) - The independence rank is represented as a 16-bit integer, which is multiplied by (max_indy_rank / 65536) to produce actual independence rank values. max_indy_rank is typically 0.84.
fireflySiteSignal (type: GoogleApi.ContentWarehouse.V1.Model.QualityCopiaFireflySiteSignal.t, default: nil) - Contains Site signal information for Firefly ranking change. See http://ariane/313938 for more details.
voltData (type: GoogleApi.ContentWarehouse.V1.Model.IndexingMobileVoltVoltPerDocData.t, default: nil) - Contains page UX signals for VOLT ranking change. See http://ariane/4025970 for more details.
Embeddings
Embedding models are used to convert documents into numerical representations in the form of multi-dimensional vectors. In machine learining Embedding is a hidden layer of certain number of units that correspond to the number of dimensions. Google can use compressed page (pageEmbeddings) or site (siteEmbeddings) data thereto. This is how it is capable of capturing the semantic similarity between documents, evaluate website quality and detect spam, etc. Implications: used in scoring of documents' Quality and Authority, including Topic Relevancy. Attribute TopicEmbeddings is a part of Compressed Quality Signals.
page Embeddings are evidently used for similarity measurement and content clustering.
site Embeddings (representation of the siteβs overall content) - for Quality Assessment and Semantic Focus:
Site Focus Score: Number denoting how much a site is focused on one topic. Implications: might be used in scoring site's Authority and Fighting spam.
Site Radius: The measure of how far page_embeddings deviate from the site_embedding. Implications: might be used in Fighting spam (documents with unrelated topics).
Potential Use cases: Content Recommendations, Spam Detection, Version Control and Temporal Analysis, e.g. monitoring how a siteβs content and quality evolve. The versionId in the QualityAuthorityTopicEmbeddingsVersionedItemmodule is an example of how versions are tracked for embeddings.
Related terms: logits, logit layer, softmax, regression analysis.
Nlp Saft Document
SAFT may refer to Structured Annotation Framework and Toolkit. SAFT is designed to perform advanced semantic parsing, annotation, and extraction of entities and their relationships from textual content. NlpSaftDocument includes references to the following functions:
Entity Identification
NlpSaftEntity: Identifies named entities in the document, such as persons (PER), organizations (ORG), and locations (LOC). Stores attributes like entityType, entityTypeProbability, gender, and name.
Entity Relations
NlpSaftRelation: Defines relations between entities in the document, such as relationships between people or connections between organizations. Includes fields like source, target, type, and score to describe these relations.
Semantic Nodes and Graphs
NlpSaftSemanticNode: Represents semantic constructions in the document, forming a directed acyclic graph (DAG) that captures complex relationships and higher-level abstractions. Connects nodes to entities, measures, and token spans, providing a rich semantic structure.
Entity Profiling
NlpSaftEntityProfile: Contains detailed information about a single unique entity, such as canonical names, attributes, and embeddings. Includes embedding vectors, attributes, and disambiguation information for precise entity representation.
Volt
Volt data (IndexingMobileVoltVoltPerDocData) contain page UX signals for VOLT ranking change. The following data are stored:
cls, fid, lcp: Core Web Vitals metrics (Cumulative Layout Shift, First Input Delay, Largest Contentful Paint).
desktopCwv, mobileCwv: Core Web Vitals metrics for desktop and mobile, respectively.
displayUrlIsHttps: Whether the displayed URL is HTTPS, a security factor that can positively affect ranking.
VOLT ranking change might be made to reward (or otherwise) the documents with rising performance.
Anchors
This groups of modules relates attributes that functions that deal with web links.
Anchors
Contain references anchor statistics, including
IndexTier: there are three tiers wherein documents are categories at indexing stage (Alexandria): Base (highest), Zeppelins, Landfills. A part of Per-doc data. Also used in AnchorsAnchor module to assign a quality mark to a source page.
Anchors dropped stats, including on
homepageAnchorsDropped (type: String.t, default: nil) - The total # of local homepage anchors dropped in AnchorAccumulator.
localAnchorsDropped (type: String.t, default: nil) - The total # of local non-homepage anchors dropped in AnchorAccumulator.
AnchorsAnchor
This module contains general anchor attributes. Importantly, data of the so called internal links (versus external / backlinks) are stored in the the same model that suggest that Google makes no principal distinction between them (see isLocal and parallelLinks attribute below).
origText (type: String.t, default: nil) - Original text, including capitalization and punctuation. Runs of whitespace are collapsed into a single space.
context2 (type: integer(), default: nil) - This is a hash of terms near the anchor. (This is a second-generation hash replacing the value stored in the 'context' field.)
sourceType (type: integer(), default: nil) - is to record the quality of the anchor's source page and is correlated with but not identical to the index tier of the source page. In the docjoins built by the indexing pipeline (Alexandria), - Anchors marked TYPE_HIGH_QUALITY are from base documents. - Anchors marked TYPE_MEDIUM_QUALITY are from documents of medium quality (roughly but not exactly supplemental tier documents). - Anchors marked TYPE_LOW_QUALITY are from documents of low quality (roughly but not exactly blackhole documents). Note that the source_type can also be used as an importance indicator of an anchor (a lower source_type value indicates a more important anchor), so it is important to enforce that TYPE_HIGH_QUALITY < TYPE_MEDIUM_QUALITY < TYPE_LOW_QUALITY.
isLocal (type: boolean(), default: nil) - The bit ~roughly~ indicates whether an anchor's source and target pages are on the same domain. Note: this plays no role in determining whether an anchor is onsite, ondomain, or offdomain in mustang (i.e., the bit above).
fullLeftContext (type: list(String.t), default: nil) - The full context. These are not written out in the linklogs.
fullRightContext (type: list(String.t), default: nil) -
expired (type: boolean(), default: nil) - true iff exp domain\locality (type: integer(), default: nil) - For ranking purposes, the quality of an anchor is measured by its "locality" and "bucket". See quality/anchors/definitions.h for more information.
parallelLinks (type: integer(), default: nil) - The number of additional links from the same source page to the same target domain. Not populated if is_local is true.
lastUpdateTimestamp (type: integer(), default: nil) - Used for history and freshness tracking - the timestamp this anchor is updated in indexing.
encodedNewsAnchorData: encoded data containing information about newsiness of anchor. Populated only if anchor is classified as coming from a newsy, high quality site. Encoded data for anchor sources are being stored in googledata/quality/freshness/news_anchors/encoded_news_anchors_data.txt Scores are being computed with quality/freshness/news_anchors/ routines.
AnchorsAnchorSource
This module contains attributes of the source document for the link, i.e. documents that sets a link to us.
homePageInfo: Information about if the source page is a home page. It can be one of the enum values defined in PerDocData::HomePageInfo (NOT_HOMEPAGE, NOT_TRUSTED, PARTIALLY_TRUSTED, and FULLY_TRUSTED). Significance: Google is decides how to value a link based on how much they trust the homepage.
nsr: This NSR value has range [0,1000] and is the original value [0.0,1.0] multiplied by 1000 rounded to an integer.
outsites (type: integer(), default: nil) - approx num of pointed-to sites
pageTags (type: list(integer()), default: nil) - Page tags are described by enum PageTag in PerDocData. Page tags are used in anchors to identify properties of the linking page. These are DEPRECATED: in the future, use link_tags instead. DEPRECATED
pagerank (type: integer(), default: nil) - uint16 scale
pagerankNs (type: integer(), default: nil) - unit16 scale
spamrank (type: integer(), default: nil) - uint16 scale
spamscore1 (type: integer(), default: nil) - deprecated, to be removed after October 20. 0-127 scale
spamscore2 (type: integer(), default: nil) - 0-127 scale
As a part of Indexing Docjoiner function , that arguably is responsible for keeping the most relevant info on the versions of the document in Google index (Alexandria), I can find multiple references for anchor spam data, including:
AnchorPhrase
A whole series of metrics about the identification of spam anchor text. Found 34 attributes, including:
phraseAnchorSpamCount: How many spam phrases found in the anchors among unique domains. Also phraseAnchorSpamProcessed: Total number of observed anchors.
phraseAnchorSpamDays: Over how many days 80% of these phrases were discovered. This is to dignify span velocity. Also phraseAnchorSpamRate: Average daily rate of spam anchor discovery. phraseAnchorSpamEnd: Time when anchor spam spike ended with padding. phraseAnchorSpamStart: Time when anchor spam spike started with padding.
phraseAnchorSpamDemoted: Total number of demoted anchors.
phraseAnchorSpamPenalty: Combined penalty for anchor demotion.
Significance: signals above identify spike of spammy anchor phrases. Anchors created during the spike are tagged with LINK_SPAM_PHRASE_SPIKE.
AnchorTrustedInfo
Arguably, Google used Trusted source database (another Seed) and compares the data to it [where relevant].
matchedScore (type: number(), default: nil) - Difference in KL-divergence from spam and non-spam anchors. Value >0 indicate that anchor text from this trusted source is similar to anchors classified as spam which means that spammy anchors are legitimate.
matchedScoreInfo (type: list(String.t), default: nil) - Detailed debug information about computation of trusted anchors match. Populated only when --anchorspam_penalizer_debug=true
phrasesScore (type: number(), default: nil) - Count of anchors classified as spam using anchor text.
trustedScore (type: number(), default: nil) - Fraction of pages with newsy anchors on the site, >0 for trusted sites.
Spam brain
Spambrain: a function (component) that detects spam. Arguably, SpamBrain is Google's AI-powered spam-fighting system, which is crucial for demoting or removing spammy content from search results.
Has the following modules:
SpamBrainData,
SpamBrainScore.
Also participates in the following modules:
IndexingDocjoinerDataVersion:
spambrainDomainSitechunkData,
spamBrainSpamBrainPageClassifierAnnotation
spamBrainSpamBrainData
QualityNsrNsrData
spambrainLavcScores
Vocabulary of Terms
Asteroid Belt (aka Quality Orbit)
Quality Orbit is database or Intents and Scores as well as confidence thresholds and [possibly] rules of their application. Used in modules named QualityOrbitAsteroidBelt. There are references to Intent Catalogue in RepositoryWebrefWebrefMustangAttachment. There is Orbit for Images and Documents and it's seen from the source that Google has the Orbit team in place. Attributes include:
intents (type: list(String.t), default: nil) - The 'intents' and 'scores' fields are stored as parallel lists for compactness. The 'scores' field should not be accessed directly, but instead through the functions in document_intent_scores_utils.
scores (type: list(integer()), default: nil) - The intent scores, scaled to integers between 0 and 100 for compactness.
Asteroid Belt Intents are a part of Per-doc data: asteroidBeltIntents (type: GoogleApi.ContentWarehouse.V1.Model.QualityOrbitAsteroidBeltDocumentIntentScores.t, default: nil) - For indexing Asteroid Belt intent scores. See go/asteroid-belt for details.
Blockbert
BlockBERT, a lightweight and efficient block-wise BERT model for better modeling long-distance dependencies. Featured in:
Brain
Common term for components or systems used analyse signals. Occurences: Spambrain, Rankbrain, Brainloc.
Craps (used in NavBoost)
Originally, a game of dice where 2 dices are used. This direction is supported by usage of "dice" vocabulary as referring to "dice-team". There is another direction: acronym CRAPS arguably stands for "Clicks and Results Prediction System". Possibly, both things point in the same directions indeed.
Craps signals encompass Clicks and Impressions (supported by "Click / impression signals for craps" in QualityNavboostCrapsCrapsData module) and, possibly, Google SERP position. Craps signals work on 2 levels: URL and Pattern (set of URLs). Supported by "crapsNewUrlSignals (type: String.t, default: nil) - For craps[url|pattern]_signals, please ..." NavBoost is a re-ranking module that uses a pair of dice metaphor and logic. Presumably, modules uses click and impression (and their proportions) as a "winning" dice combination per specific position in SERP: if document get's a better combo per position than other, it gets a boost.
Chrome data
Occurs 34 times in the document, Chrome-related attributes suggest that Google uses data from its Chrome browser to influence search results. Used in:
deciding Top URLs - for Sitelinks creation,
Site-level Chrome views - used in Quality Scoring via NSR
Unique Chrome views - for video content.
Firefly
Firefly is a symbol of hope. FireflySiteSignal is a part of per-doc data and belongs to QualityCopia modules, also used in IndexingDocjoinerDataVersion. As a part of per-doc data, contains Site signal information for Firefly ranking change. As a part of QualityCopia group of modules contains dailyClicks, dailyGoodClicks, etc. and totalImpressions. May be FireflySiteSignal is made to reward (or otherwise) the documents with rising performance.
QualityCopiaFireflySiteSignal:
numOfArticlesByPeriods: number of articles (lattice article score is 0.8 or more) sliced by 30 days (num_of_articles_by_periods[0] is the newest). Implies that it might pay off if you publish new articles regularly.
Fringe Queries
Google has a function for detecting fringe queries with data stored in Per-Doc Data. Arguably, this is to define a probability that a [search] query has fringe or peripheral meaning in relation to dominant meaning, given this document's site is [already] in the result set.
encodedDocumentFringeVulnerability (type: integer(), default: nil) - An estimate of the vulnerability of this doc to show fringe content, based on the context around the document. Can be interpreted as a 'safe' QScore threshold to use (see go/doc-fringe-vulnerability for more info).
encodedEntityPriorScore (type: integer(), default: nil) - Highest entity prior seen for document's Headline and SingleTopic entities (see go/topicality-score for definitions of entity topicalities). Represents probability that a query is fringe, given that the entity is in the result set with topicality >= Headline.
encodedFringePriorScore (type: integer(), default: nil) - Probability that a query is fringe, given this document is in the result set. Scores scaled to integers between 0 and 1000 for compactness.
Seen in the context with the following entities as a part of ClassifierPornQueryClassifierOutput : csai, fringe, medical, minor, offensive, porn, spoof, violence, vulgar.
Also, is a part of RepositoryWebrefNameScores (see below).
Goldmine
Alternative Titles [& Snippets] Generator (supported by goldmineOtherBoostFeatureCount (type: integer(), default: nil) - The number of BoostFeatures present in AlternativeTitlesGeneator but not populated above).
Goldmine used often times in the document as belonging to Quality Preview Module. Arguably, acts as a preparation stage in Ascorer: to score candidates for Google search. Might include: RankBrain, Snippets scoring.
Internal boost feature signals used to compute goldmine_page_score. They are exposed only for debugging purpose.
See Blockbert in Vocabulary Terms.
Has Site Info
Whether a title contains site information. Module name: QualityPreview.
Index Tier
There three tiers wherein documents are categories at indexing stage (Alexandria): Base (highest), Zeppelins, Landfills. A part of Per-doc data. Also used in AnchorsAnchor module to assign a quality mark to a source page.
Significance: Googleβs index is stratified into tiers where the most important, regularly updated, and accessed content is stored in flash memory. Less important content is stored on solid state drives, and irregularly updated content is stored on standard hard drives.
Implications: the higher the tier, the more valuable the link. Pages that are considered βfreshβ are also considered high quality. Suffice it to say, you want your links to come from pages that either fresh or are otherwise featured in the top tier. This partially explains why getting rankings from highly ranking pages and from news pages yields better ranking performance.
Newsy
Refers to new and classy (high quality content).
Ocean
Model and data for indexing book pages. Has the following modules:
OceanDocTag:
pagerank (type: float(), default: nil) - the page rank value of the book page.
contentType (type: integer(), default: nil) - The content type of the document. See BoundVolumeSource::ContentType in ocean/data/volume_types.protodevel for possible values.
OceanPerDocData
Ranklab
A collection of data that is used for producing ranking signals (scoring) and selecting for Title and Snippet of the document. Used for scoring candidate Title and Snippets as well as recording of their features. Seen in: QualityPreview modules.
PageRank (aka Pagerank NS, Nearest Seeds)
Google arguably has a collection of trusted articles on all topics - the gold standard of trust. All selected links form a link graph. The rank or correlation of each link is calculated by the distance from the trusted documents, which is a standard graph algorithm. This is called NeerestSeeds method. For example, if a trusted article from The New York Times links to an article on site X, and an article from site X links to an article on site Y, and an article on site Y links to wlw, the distance will be 3. The distance in graphs is calculated not by nodes, but by links or edges. The smaller the distance, the better for this indicator. The gold standard is also mentioned in previous leaks. Bill Slawski described it here: https://www.seobythesea.com/2018/04/pagerank-updated/ Module: AnchorsAnchorSource (attributes of the source document for the link) . Also, pagerank Weight: Weight to be stored in linkmaps for pageranker homepagePagerankNs The page-rank of the homepage of the site (module: PerDocData). Every document has its homepage PageRank (the Nearest Seed version) associated with it. This likely used as a proxy for new pages until they capture their own PageRank.
Panda
In the real world, Pandas play an important role in the forest ecosystem where they live. Seeds and plant matter collects on their fur, which is then deposited as they move throughout their habitat. They also climb trees and swim, which further helps disperse the seeds. Panda component (previously Google Update) suggest as follows: where the site has a lot inbound links whereas small number of discovered keywords and clicks, this site shall be considered Unhelpful. The bottom line: you need to drive more successful clicks using a broader set of queries and earn more link diversity if you want to continue to rank. Conceptually, it makes sense because a very strong piece of content will do that. . Entries in the DB: 139.
Quality Nsr
NSR most probably means Neural Semantic Retrieval. Introduced in 2018 and used to better understand how queriesrelate to pages. "It looks at an entire query or page rather than just keywords, developing a better understanding of the underlying concepts represented in them". Called a critical part in document retrieval. Relates:
site Score: Site-level aggregated keto score.
Keto score is arguably score of a portion of "meat" (protein) documents versus "chard" (carbon) documents, see below.
title match Score: Titlematch score of the site, a signal that tells how well titles are matching user queries.
Chard Score: Site-level chard score: site quality predictor based on content. Also, Chard Score Variance: Site-level Chard Variance for all pages of a site.
Tofu or Site-level tofu score: site quality predictor based on content.
Site Link out: Aggregated value of url link out scores for this sitechunk.
clutterScore: Delta site-level signal in Q* penalizing sites with a large number of distracting/annoying resources loaded by the site
Locality score of the site, i.e. the locality component of the LocalAuthority signal
predicted Default Nsr: arguably, for new documents, Predicted default NSR score computed in Goldmine via the NSR default predictor.
site Authority, converted from quality_nsr.SiteAuthority, applied in Qstar.
nsr Confidence: same as nsr_variance - implies the level of confidence in NSR data calculations.
Has interesting specific attributes that are connected to Covid and Elections:
nsrIsCovidLocalAuthority
nsrIsElectionAuthority
Repository Web References
RepositoryWebref group of modules scores and stores semantic data that are used as per-doc data for serving into Mustang. Includes 91 attributes (lines 7212-7303). Includes:
completeWorldVolume (type: number(), default: nil) - Approximates volume of this name including entities we don't have in our set. This is useful to estimate the size of the "open world". For example, this field can be equal to the total number of clicks for the query. Note that for now, we ignore the number of clicks but just use the idf if available.
contextFringeScore (type: number(), default: nil) - The fringe score in [0, 1] for this context name. This is only populated for full-query context names.
contextWeight (type: number(), default: nil) - The amount of evidence available for this context name.
idfScore (type: number(), default: nil) - The idf of this name.
openWorldVolumeModifier (type: number(), default: nil) - 'Raw' open world is computed based on ngram count data. For some names however, we modify the open world based on entities associated with them. This field contains the additive open world modifier. open_world_volume = raw open world + open_world_volume_modifier
totalScore (type: number(), default: nil) - The total score of that name/source. It describes how much data we have for that name/source. It can for example be the sum of all the entity scores for this name.
Scamness
Scam model score. Used as one of the web page quality Qstar signals. Implications: might be used in Fighting spam.v
Scorers
Collectively, Scoring systems (supported by "Versioned TopicEmbeddings data to be populated later into superroot / used directly in scorers"). Related: Ascorer. Ascorer seems to be what SEOs refer to as a system collecting ranking signals and defining a final score used for ranking. Referred to as being the core of the google algorithm used for information retrieval.
Site Authority
Also referred to as PageRank variants, a compressed quality signal used in preliminary scoring [of search results]. Exist on domain (site-wide) level. Homepage signal applies to all pages.
Source Type
See also Index Tier. Is to record the quality of the anchorβs source page and is correlated with but not identical to the index tier of the source page. In the docjoins built by the indexing pipeline (Alexandria), β Anchors marked TYPE_HIGH_QUALITY are from base documents. β Anchors marked TYPE_MEDIUM_QUALITY are from documents of medium quality (roughly but not exactly supplemental tier documents). β Anchors marked TYPE_LOW_QUALITY are from documents of low quality (roughly but not exactly blackhole documents).
Tundra project
URL history
Url change history for this doc (see crawler/changerate/changerate.proto for details). Note if a doc has more than 20 changes, we only keep the last 20 changes here to avoid adding to much data in its docjoin. Module: IndexingInfo. Module description: Contains information *mostly* used within indexing (e.g. not used for building the production serving shards). Most of this data is generated only in Alexandria, however there are exceptions. Implications: This should give a sense of how many times you need to change pages and have them indexed to get a βclean slateβ in Google.
work in progress, to be expanded
Last updated