commit cba6546c87ed10c3d55c7154510503fb2d610472 · davidgasquez.com/handbook

+3 -27

Deep Funding.md

··· 1 1 # Deep Funding 2 2 3 - The goal of [Deep Funding](https://deepfunding.org/) is to develop a system that can allocate resources to public goods with a level of accuracy, fairness, and open access that rivals how private goods are funded by markets, ensuring that high-quality open-source projects can be sustained. Traditional price signals don't exist, so we need "artificial markets" that can simulate the information aggregation properties of real markets while being resistant to the unique failure modes of public goods funding. 3 + The goal of [Deep Funding](https://deepfunding.org/) is to develop a system that can allocate resources to public goods with a level of accuracy, fairness, and open access that rivals how private goods are funded by markets, ensuring that high-quality open-source projects can be sustained. Traditional price signals don't exist, so we need "artificial markets" that can simulate the information aggregation properties of real markets while being resistant to the unique failure modes of public goods funding. [Deep Funding is an Impact Evaluator](https://hackmd.io/@dwddao/HypnqpQKke). 4 4 5 5 In Deep Funding, multiple mechanisms (involving data, mechanism design, and open source) work together. Each layer can be optimized and iterated independently. 6 6 ··· 72 72 73 73 ## Ideas 74 74 75 - ### Alternative Approach 76 - 77 - Given the current open problems, this is interesting and alternative way ([inspired by RLHF](https://www.itai-shapira.com/pdfs/pairwise_calibrated_rewards_for_pluralistic_alignment.pdf)) of running a Deep Funding "round". The gist of the idea is to **use only a few significant data points to choose and reward the final models** instead of deriving weights for the entire set of children/dependencies of a project. Resolve the market with only a few, well-tested pairs! 78 - 79 - Like in the current setup, a DAG of projects is needed. The organizers publish that and also an encoded list of projects that will be evaluated by Jurors. Participants can only see the DAG, the "evaluated projects" will be revealed at the end. 80 - 81 - Once participants have worked on their models and send/trade their predictions, the "evaluated project" list is revealed and only those projects are used to evaluate weights' predictions. Best strategy is to price truthfully all items. The question here is: how can we evaluate only a few projects without jurors giving a connected graph to the rest of the projects? 82 - 83 - Since we don't have a global view (no interconnected graph), we need to use comparative and scale-free metrics. Metrics like the [Brier score](https://en.wikipedia.org/wiki/Brier_score) or methods like [Bradley Terry](https://www.itai-shapira.com/pdfs/pairwise_calibrated_rewards_for_pluralistic_alignment.pdf) can be used to evaluate any model or trader's weights ([in that case you're fitting just a single global scale or temperature parameter to minimize negative log-likelihood](https://apxml.com/courses/rlhf-reinforcement-learning-human-feedback/chapter-3-reward-modeling-human-preferences/reward-model-calibration))! Don't give the mechanism explicit instructions. Just give it a goal (and compute rewards) and let it figure out the best strategy. 84 - 85 - Once the best model is chosen (the one that acts closest to the chosen subset of pairwise comparisons), the same pairwise comparisons can be used [to adjust the scale of the weight distribution](https://proceedings.mlr.press/v70/guo17a/guo17a.pdf). That means the market resolution uses only the subset (for payouts to traders) but the funding distribution uses the model's global ranking with its probabilities calibrated to the subset via a single scalar _a_ that pins the entire slate to the same scale that was verified by real judgments. The **jurors** pairwise comparisons can even be "merged" with the best model to incorporate all data in there. 86 - 87 - Basically, there are two steps; first, select the best model and then, rescale weights using the jury pairwise comparisons. With much fewer comparisons, we can get to a better final weight distribution since we have more significant graph (relative weights) and we also use the golden juror pairs to adjust the scale. 88 - 89 - The task of the organizers is to [gather pairwise comparisons to make this subset significant](https://arxiv.org/pdf/1505.01462), which is much simpler and feasible than doing it so for the entire dependencies of a node (can be 128). With [efficiently sampled pairs](https://arxiv.org/abs/2302.13507) ([or approximate rankings](https://proceedings.mlr.press/v84/heckel18a.html)) far fewer comparisons are needed in a subset. 90 - 91 - Once the competition ends, extra comparisons could be gathered for projects that have high variance or via another trigger mechanism. 92 - 93 - ### More Ideas 94 - 95 - - Fix weight distributions (Zipf law) and make modelers focus on predicting the rank. Pick the model that aligns the most with the pairwise data collected. 96 - - Set a consensus over which meta-mechanism is used to evaluate weights (e.g: Brier Score). Judged/rank mechanism/models solely on their performance against the rigorous pre-built eval set. No subjective opinions. Just a leaderboard of the most aligned weight distributions. 75 + - An alternative is to take an ([inspired by RLHF](https://www.itai-shapira.com/pdfs/pairwise_calibrated_rewards_for_pluralistic_alignment.pdf)) approach. **Use only a few significant data points to choose and reward the final models** instead of deriving weights for the entire set of children/dependencies of a project. Resolve the market with only a few, well-tested pairs! 76 + - Fix weight distributions (Zipf law) and make modelers/jurors focus on predicting the rank. Pick the model that aligns the most with the pairwise data collected. 97 77 - Win rates can be derived from pairwise comparisons 98 78 - Lean on the [[Pairwise Comparisons]] playbook (binary questions over intensity, active sampling, filtering noisy raters) for any human labeling. 99 - - Do some post-processing to the weights: 100 - - Report accuracy/Brier and use paired bootstrap to see if gap is statistically meaningful 101 - - If gaps are not statistically meaningful, bucket rewards (using Zipf's law) so it feels fair 102 - - How would things look like if they were [Bayesian Bradley Terry](https://erichorvitz.com/crowd_pairwise.pdf) instead of [classic Bradley-Terry](https://gwern.net/resorter)? Since comparisons are noisy and we have unreliable jurors, can we [compute distributions instead of "skills"](https://github.com/max-niederman/fullrank)? 103 79 - Instead of one canonical graph, allow different stakeholder groups (developers, funders, users) to maintain their own weight overlays on the same edge structure. Aggregate these views using quadratic or other mechanisms 104 80 - If there is a plurality of these "dependency graphs" (or just different set of weights), the funding organization can choose which one to use! The curators gain a % of the money for their service. This creates a market-like mechanism that incentivizes useful curation. 105 81 - Let the dependent set their weight percentage if they're around

+2 -11

Impact Evaluators.md

··· 2 2 3 3 Impact Evaluators are frameworks for [[Coordination|coordinating]] work and aligning [[Incentives]] in complex [[Systems]]. They provide mechanisms for retrospectively evaluating and rewarding contributions based on impact, helping solve coordination problems in [[Public Goods Funding]]. 4 4 5 - It's hard to do [[Public Goods Funding]], open-source software, research, etc. that don't have a clear, immediate financial return, especially high-risk/high-reward projects. Traditional funding often fails here. Instead of just giving money upfront (prospectively), Impact Evaluators create systems that look back at what work was actually done and what impact it actually had (retrospectively). It's much easier to judge the impact in a retrospective way! 5 + It's hard to do [[Public Goods Funding]], open-source software, research, etc. that don't have a clear, immediate financial return, especially high-risk/high-reward projects. Traditional funding often fails here. Instead of just giving money upfront (prospectively), Impact Evaluators create systems that look back at what work was actually done and what impact it actually had (retrospectively). **[It's much easier to judge the impact in a retrospective way](https://medium.com/ethereum-optimism/retroactive-public-goods-funding-33c9b7d00f0c)**! 6 6 7 7 - The extent to which an intervention is _causally responsible_ for an specific outcome (intended or unintended) is a hard thing to figure out. There are many classic approaches; Theory of Change, Data Analysis, ML, ... 8 8 - The goal is to **create a system with strong [[Incentives]] for people/teams to work on valuable, uncertain things** by distributing a reward according to the demonstrable impact. ··· 134 134 - People only reveal their true opinions after seeing the result (you need to show people something and iterate based on their reactions in order to build something they actually want). 135 135 - **Exploration vs Exploitation**. IEs are optimization processes with tend to exploit (more impact, more reward). This ends up with a monopoly (100% exploit). You probably want to always have some exploration. 136 136 - [IEs need to show how the solution is produced by the interactions of people each of whom possesses only partial knowledge](https://news.ycombinator.com/item?id=44232461). 137 + - Set a consensus over which meta-mechanism is used to evaluate weights (e.g: Brier Score). Judged/rank mechanism/models solely on their performance against the rigorous pre-built eval set. No subjective opinions. Just a leaderboard of the most aligned weight distributions. 137 138 138 139 ## Principles 139 140 ··· 173 174 - [Generalized Impact Evaluators, A year of experiments and theory](https://research.protocol.ai/blog/2023/generalized-impact-evaluators-a-year-of-experiments-and-theory/) 174 175 - [Deliberative Consensus Protocols](https://jonathanwarden.com/deliberative-consensus-protocols/) 175 176 - [Credible Neutrality](https://balajis.com/p/credible-neutrality) 176 - - [Quadratic Payments: A Primer](https://vitalik.eth.limo/general/2019/12/07/quadratic.html) 177 - - [Quadratic Funding is Not Optimal](https://jonathanwarden.com/quadratic-funding-is-not-optimal/) 178 - - [A Mild Critique of Quadratic Funding](https://kronosapiens.github.io/blog/2019/12/13/mild-critique-qf.html) 179 177 - [Funding impact via milestone markets](https://docs.fileverse.io/0x0D97273dee4D1010321f9eBa2e9eaB135C17D6dE/0#key=5GgcacTDy2h1QwWV9vJqGD-YzwomzuIOueMACpjghbJLxfG3ZqbWl1qDC1Le04uR) 180 178 - [Kafka Index](https://summerofprotocols.com/wp-content/uploads/2024/04/Kafka-Index-Nadia-Asparouhova-1.pdf) 181 179 - [The Unreasonable Sufficiency of Protocols](https://summerofprotocols.com/the-unreasonable-sufficiency-of-protocols-web) 182 - - [Good Death](https://summerofprotocols.com/research/good-death) 183 - - [Retroactive Public Goods Funding](https://medium.com/ethereum-optimism/retroactive-public-goods-funding-33c9b7d00f0c) 184 - - [The Public Goods Funding Landscape](https://splittinginfinity.substack.com/p/the-public-goods-funding-landscape) 185 180 - [Coordination, Good and Bad](https://vitalik.eth.limo/general/2020/09/11/coordination.html) 186 181 - [On Collusion](https://vitalik.eth.limo/general/2019/04/03/collusion.html) 187 182 - [Remuneration Rights](https://openrevolution.net/remuneration-rights) ··· 196 191 - [Soulbinding Like a State](https://newsletter.squishy.computer/p/soulbinding-like-a-state) 197 192 - [Market Intermediaries in a Post-AGI World](https://meaningalignment.substack.com/p/market-intermediaries-a-post-agi) 198 193 - [Goodhart's Law Not Useful](https://commoncog.com/goodharts-law-not-useful/) 199 - - [Ten Kilograms of Chocolate](https://medium.com/@florian_32814/ten-kilograms-of-chocolate-75c4fa3492b6) 200 194 - [Bittensor's Anatomy of Incentive Mechanism](https://docs.bittensor.com/learn/anatomy-of-incentive-mechanism) 201 - - [Frequently Asked Questions (And Answers) About AI Evals](https://hamel.dev/blog/posts/evals-faq/) 202 195 - [Proportionally fair online allocation of public goods with predictions](https://dl.acm.org/doi/abs/10.24963/ijcai.2023/3) 203 196 - [A natural adaptive process for collective decision-making](https://onlinelibrary.wiley.com/doi/10.3982/TE5380) 204 197 - [Tournament Theory: Thirty Years of Contests and Competitions](https://www.researchgate.net/publication/275441821_Tournament_Theory_Thirty_Years_of_Contests_and_Competitions) ··· 206 199 - [Asymmetry of verification and verifier's law](https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law) 207 200 - [Ostrom's Common Pool Resource Management](https://earthbound.report/2018/01/15/elinor-ostroms-8-rules-for-managing-the-commons/) 208 201 - [Community Notes Note ranking algorithm](https://communitynotes.x.com/guide/en/under-the-hood/ranking-notes) 209 - - [Deep Funding is a Special Case of Generalized Impact Evaluators](https://hackmd.io/@dwddao/HypnqpQKke) 210 202 - [Analysing public goods games using reinforcement learning: effect of increasing group size on cooperation](https://royalsocietypublishing.org/doi/10.1098/rsos.241195) 211 - - [CrowdTruth 2.0: Quality Metrics for Crowdsourcing with Disagreement](https://arxiv.org/pdf/1808.06080) 212 203 - [Coevolutionary dynamics of population and institutional rewards in public goods games](https://www.sciencedirect.com/science/article/pii/S095741742302081X)

+6

Mechanism Design.md

··· 71 71 - **Reinforcement Learning for Meta-Evaluation** - Use RL to evolve evaluation mechanisms through trial and error. The system learns which evaluation approaches work best in different contexts by treating mechanism selection as a sequential decision problem. 72 72 - **Genetic Algorithms** - Evolution-based optimization for evaluation mechanisms. Breed and mutate successful evaluation strategies, allowing the system to discover novel approaches through recombination and selection pressure. 73 73 - **Schelling Point Coordination Games** - Information elicitation mechanisms where truth naturally emerges as the coordination point. Participants are incentivized to report honestly because they expect others to do the same, making truth the natural focal point. 74 + 75 + ## Resources 76 + 77 + - [Quadratic Payments: A Primer](https://vitalik.eth.limo/general/2019/12/07/quadratic.html) 78 + - [Quadratic Funding is Not Optimal](https://jonathanwarden.com/quadratic-funding-is-not-optimal/) 79 + - [A Mild Critique of Quadratic Funding](https://kronosapiens.github.io/blog/2019/12/13/mild-critique-qf.html)

+6 -1

Pairwise Comparisons.md

··· 14 14 - Keep the UX fast and low-friction. Suggest options, keep context in the UI, and let people expand only if they want. 15 15 - Avoid intensity questions. They are order-dependent and [require global knowledge](https://xkcd.com/883/). 16 16 - Use [active sampling](https://projecteuclid.org/journals/annals-of-statistics/volume-47/issue-6/Active-ranking-from-pairwise-comparisons-and-when-parametric-assumptions-do/10.1214/18-AOS1772.pdf)/dueling bandits to focus on informative pairs. Stop when marginal value drops. 17 + - With [efficiently sampled pairs](https://arxiv.org/abs/2302.13507) ([or approximate rankings](https://proceedings.mlr.press/v84/heckel18a.html)) far fewer comparisons are needed. 17 18 - [Top-k tasks](https://proceedings.mlr.press/v84/heckel18a.html) can scale collection (pick best 3 of 6) while still convertible to pairwise data. 18 19 - Expect [noisy raters](https://arxiv.org/abs/1612.04413). Filter or reweight after the fact using heuristics or gold questions instead of overfitting to ["experts" biases](https://link.springer.com/article/10.1007/s10618-024-01024-z). 19 20 ··· 22 23 - There are many aggregation/eval rules; [Bradley-Terry](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model), [Huber in log-space](https://en.wikipedia.org/wiki/Huber_loss), [Brier](https://en.wikipedia.org/wiki/Brier_score), ... 23 24 - Converting pairs into scores or rankings is standard; start with Elo/Bradley-Terry (or crowd-aware variants) before custom models. 24 25 - Use robust methods (crowd BT, hierarchical BT, [Bayesian variants](https://erichorvitz.com/crowd_pairwise.pdf)) to correct annotator bias and uncertainty. 25 - - Expert jurors can be inconsistent, biased, and expensive. [Large graphs of comparisons](https://arxiv.org/pdf/1505.01462) are needed to tame variance. 26 + - Expert jurors can be inconsistent, biased, and expensive. [Large graphs of comparisons](https://arxiv.org/pdf/1505.01462) are needed to tame variance. You can estimate how many pairwise comparisons are needed to make raking significant. 26 27 - You can report accuracy/Brier by using [bootstrap](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)). 27 28 28 29 ## Resources ··· 31 32 - [Designing a Better Judging System](https://anishathalye.com/designing-a-better-judging-system/) 32 33 - [Quadratic vs Pairwise](https://blog.zaratan.world/p/quadratic-v-pairwise) 33 34 - [An Analysis of Pairwise Preference](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3359677) 35 + - [CrowdTruth 2.0: Quality Metrics for Crowdsourcing with Disagreement](https://arxiv.org/pdf/1808.06080) 36 + - [Ten Kilograms of Chocolate](https://medium.com/@florian_32814/ten-kilograms-of-chocolate-75c4fa3492b6) 37 + - [Tool to sort items using Bradley-Terry](https://gwern.net/resorter) 38 + - [Tool to sort items using a Bayesian approach](https://github.com/max-niederman/fullrank)