Monorepo for Tangled tangled.org

[Suggestion]: Strengthening the openness, decentralisation and resiliency of Tangled records #56

open opened by

nel.pet 11 months ago

Back when pulls v1 was released I mentioned on IRC a few issues I saw in it's supporting record lexicons. Specifically different pull rounds only being available through CIDs and the fact that pull comments don't strong ref the pull they're commenting on. This lead to me wanting to take a general look at the Tangled lexicons as a whole to see if there were any other things like that that I could see and this issue of suggested changes is the result of that! (The core goal here is the same as https://tangled.sh/@tangled.sh/core/issues/29 and the like. That is: Making Tangled as open a platform as possible)

All of these were of course reviewed with general good design in mind but also specifically with consideration for openness, decentralisation and in a general sense "correctness" in that records ought to be designed to allow open, third party participation. Anyone (with the resources to do so) should be able to crawl the network and build out the same exact view of Tangled related data in the network as the one the official AppView has or make a third party client to interact with Tangled. Designing Tangled like this is not only crucial to keeping Tangled open but is also a key part in making Tangled follow general ecosystem guidelines and good practice as well as allowing Tangled to develop in the future. An open system will benefit from it's openness just as much as it's users will.

Of course the existence of this issue begs the question of how to (if wanted) resolve the issues I lay out below. Per AT Proto spec "published" lexicons are only allowed to change in ways that keep both backwards and forwards compatibility. However "unpublished" lexicons are allowed to change as much as the creator wants. On when a lexicon is considered "published" the spec states: It can be ambiguous when a Lexicon has been published and becomes "set in stone". At a minimum, public adoption and implementation by a third party, even without explicit permission, indicates that the Lexicon has been released and should not break compatibility. A best practice is to clearly indicate in the Lexicon type name any experimental or development status. Eg, com.corp.experimental.newRecord. This doesn't definitively define whether or not the Tangled lexicons would be considered "published". On one hand records using Tangled lexicons are out there and being used, created and crawled. However this usage is currently mostly or entirely limited to the official AppView itself. By most reasonable definitions Tangled's lexicons are most likely "published" to some degree but it could be argued (which I do here) that they aren't "published" or at least not published enough to be set in stone. Especially considering Tangled's status as alpha software.

A lot of these issues are issues in lexicons that are very fundamental to the design of Tangled and the official Tangled AppView still maintains effective control over all Tangled related records. This means that there is both a lot of incentive to and the possibility of the lexicons being updated and the AppView (or lets be honest some kind of migration script) tasked with going through all the records (known to the AppView) that conform to the old lexicons and updating them to fit the new one (based on the info that the AppView has). This could very well cause issues and it should be considered if there are other better ways to handle this. Regardless I think the importance of getting these things right early on warrants what ever solution will work best long term. Personally I think it is appropriate to more or less break spec (in an area that is a bit of a grey zone) for the sake of making sure that these core, fundamentals have a good, lasting design and won't become an unfixable issue in Tangled's future. The sooner changes like these are made the better. At a certain point Tangled will become big enough that changes like these are not possible to make. Now enough background time for the concrete issues I have.

sh.tangled.*#

This section is great. I have some nitpicks about a few created and adddedAt fields that probably should be createdAt otherwise great work guys!

sh.tangled.repo.issue.*#

The overall design of this section is very good I think. The main things here actively making life harder for any third party are the issueId and commentId fields in sh.tangled.repo.issue and sh.tangled.repo.issue.comment respectively. Having the numeric IDs built into the record itself means that any third party (like a third party client, third party AppView or potentially even Tangled itself in the future depending on how the architecture evolves) wishing to create an issue or a comment need to have a full and somewhat authoritative view of every issue or comment that has previously been created and/or is in the middle of being created to know the appropriate value for these fields. Otherwise they risk two issues or comments being made around the same time having the same ID in their ID field. Basically: potential race conditions here. It would be preferable to handle the numeric IDs of issues and their comments entirely on the AppView based on the createdAt of each record or (in my opinion) better yet abandon the notion of numeric IDs for these things and instead reference them purely by their AT URIs (with some potential future QOL things to make it easy/possible to reference issues by ID in comments or other issues just like you can do on GitHub with the # notation). The AT URI is the canonical reference for any record and this would make the system more resilient to people making records with conflicting IDs, purpose made createdAts that slot an issue in between two older ones, etc.

Comments not having any kind of reply field could potentially be an issue. A third party consumer would be able to arrange the comments just fine by createdAt (baring the ability to slot comments in between others as mentioned above) but having replies would make crawling the network for relevant records easier for a non archival use case (and would be resilient to people messing with createdAt). Having proper first party replies would (IMO) be a good feature in general but that depends on the direction you want to take issue comments so that's probably more of a design goal decision. If comments end up getting a reply field this should definitely be strong refed and either do like BlueSky's replies and have a reference to both the root and the parent or only have one to the parent.

On a purely good design (and more subjective) note while the inclusion of an owner field isn't necessarily problematic in any way it is a bit redundant considering that the record will be in the owners PDS repo. The required fields of sh.tangled.issue, sh.tangled.issue.comment and sh.tangled.issue.state also all seem to be missing some quite important fields (stuff like body on sh.tangled.issue, state on sh.tangled.issue.state and honestly most fields on sh.tangled.issue.comment). Removing or keeping the owner field probably doesn't matter all that much either way. Specifying a better set of required fields would probably be good though. Required fields also matter for openness in that if it is possible and allowed to create records without certain fields that enable crawling any third party support (at least technically) goes out the window to a certain degree.

sh.tangled.repo.pull.*#

This section needs a bit more love IMO but is again overall good. The same issue with race conditions for the ID fields exist for the commentId and pullId fields and my nitpicky opinions on the owner field apply here too. The real meat of this section is in how pull rounds are handled. As I mentioned above when I first brought that up on IRC @oppili.bsky.social responded saying that rounds are handled using CIDs. Each round is a different CID of the same pull record. At face value this is fine enough. Different versions (CIDs) of one record to represent different versions of one pull. However in an openness and decentralisation sense this gets ... tricky. There are currently no APIs for enumerating the available versions of any given record. If I as a third party (or for whatever reason need to figure out the rounds from scratch) want to independently figure out the rounds of a PR I currently simply can't (It may be possible to use sequence numbers for the firehose or the cursor for jetstream to backfill to get this info to a certain extent however these mechanisms aren't meant to allow backfilling far into the past but only periods of time in the order of hours or days and thus can't be relied on for this kind of long term backfill). Even if comments get changed to include strong refs to the rounds they're commenting on I wouldn't be able to figure out what order the rounds go in purely based on the info that's currently kept in PDSs (and rounds without any comments would be almost entirely undiscoverable). There's a lot of discussion to be had here over how best to implement this. Whether to consider rounds a kind of special comment, make a specialised record type or some third thing. To me it mostly doesn't matter as long as it's done in a way that allows for backfill by third parties.

So. That was probably ... a lot 😅. But I think it's important to consider these kinds of things. They will be very very relevant in the future not only for supporting things like third party clients but for any transition of the Tangled architecture to one that fits better with the general AT Proto architecture (where among other things the PDS starts to play an even more active role) than the current one does. Personally I think the course of action from here is to hash out the lexicons, figure out how best to migrate to these new lexicons and then start implementing the new lexicons and the migration. I think it's okay to be potentially disruptive with any migration purely from the fact that Tangled is in alpha so in general backwards and forwards compatibility isn't guaranteed. Basically in this whole thing I'm assuming that the whole sh.tangled namespace can be considered experimental and up for change as of now. There is quite a bit of work here I know. I hope to help out where I can.

oppi.li 11mo ago

Firstly, thanks for taking the time to do this deep dive!

By most reasonable definitions Tangled's lexicons are most likely "published" to some degree but it could be argued (which I do here) that they aren't "published" or at least not published enough to be set in stone. Especially considering Tangled's status as alpha software.

I agree with you here, we are not opposed to changing lexicon format at all while we are in alpha status, and we definitely want to move in the direction of making this accessible to alternative clients; which for now could mean authoring meaningful record lexicons; but down the line: XRPC queries for easy access.

sh.tangled.*#

This section is great. I have some nitpicks about a few created and adddedAt fields that probably should be createdAt otherwise great work guys! sh.tangled.repo.issue.*

Ah yeah, I have felt the pain of calling it "created". It makes more sense to have "createdAt", "editedAt", "deletedAt", etc.

sh.tangled.repo.issue.*#

The overall design of this section is very good I think. The main things here actively making life harder for any third party are the issueId and commentId fields in sh.tangled.repo.issue and sh.tangled.repo.issue.comment respectively. Having the numeric IDs built into the record itself means that any third party (like a third party client, third party AppView or potentially even Tangled itself in the future depending on how the architecture evolves) wishing to create an issue or a comment need to have a full and somewhat authoritative view of every issue or comment that has previously been created and/or is in the middle of being created to know the appropriate value for these fields. Otherwise they risk two issues or comments being made around the same time having the same ID in their ID field.

I agree with this. I do want users to be able to refer to issues on the tangled appview using short identifiers. As with all "names"; the holy trifecta would be globally unique, memorable and resolvable; integers are only unique and memorable but can only be resolved in the context of our appview. In the atmosphere however: the shortest globally unique name is an aturi; which is, given we know the NSID already, the pair consisting of (did, rkey). I think the changes we want to make here are: to remove the bespoke IDs for issues and comments from the record, but invent some form of ID to preserve the "ticket" linking mechanism, if only on the appview. Something I can think of right away: is to use short issue/pr IDs on the appview side when a record is indexed, but also offer permalinks by using the did+rkey combo.

Comments not having any kind of reply field could potentially be an issue. A third party consumer would be able to arrange the comments just fine by createdAt (baring the ability to slot comments in between others as mentioned above) but having replies would make crawling the network for relevant records easier for a non archival use case (and would be resilient to people messing with createdAt). Having proper first party replies would (IMO) be a good feature in general but that depends on the direction you want to take issue comments so that's probably more of a design goal decision. If comments end up getting a reply field this should definitely be strong refed and either do like BlueSky's replies and have a reference to both the root and the parent or only have one to the parent.

I agree on this one too! IMO I am still unsure if we want total threading in issues (comments on comments), or if we want this to be a linear thread. In the latter case (which is currently live), I would want to have comments strongref to the issue itself. In general, I agree, this would help crawl the network seamlessly.

On a purely good design (and more subjective) note while the inclusion of an owner field isn't necessarily problematic in any way it is a bit redundant considering that the record will be in the owners PDS repo. The required fields of sh.tangled.issue, sh.tangled.issue.comment and sh.tangled.issue.state also all seem to be missing some quite important fields (stuff like body on sh.tangled.issue, state on sh.tangled.issue.state and honestly most fields on sh.tangled.issue.comment). Removing or keeping the owner field probably doesn't matter all that much either way. Specifying a better set of required fields would probably be good though. Required fields also matter for openness in that if it is possible and allowed to create records without certain fields that enable crawling any third party support (at least technically) goes out the window to a certain degree.

Fully agree, this seems to be a design oversight from the early days. My braincells are rubbing together every now and then to form these thoughts, but they clearly need more time.

sh.tangled.repo.pull.*

Thanks for the detail on this one. I had somehow assumed that getRecord had a way to enumerate CIDs, but it seems that is not possible; it does seem possible to get one record at a given CID. I think I understand the issues presented here, and a few constraints emerge for a redesign:

Some way to allow multiple users to "contribute" to a PR
Distinction between rounds; preferably two records: sh.tangled.repo.pull and sh.tangled.repo.pull.round
Some way to order rounds within a PR, so it is obvious how to rebuild this from a listRecords like output

I think a unique record makes most sense. Alternatively, if we could work in a way to enumerate CIDs into the official APIs, that would be neat.

Like you say, tangled is in alpha, and we want to use this window to figure things out, so breaking things is not something we are too worried about (we will of course aim to migrate data indexed by appview thus far). And yes, it definitely is a lot of work, but its a great value add and I absolutely want to put effort into driving this forward! For other readers that are interested in advancing this; do drop into the IRC channel (#tangled on libera), more than happy to discuss, and potentially mentor the work around this.

nel.pet (author) 11mo ago

Firstly, thanks for taking the time to do this deep dive!

I'm just glad you're so open to all this! I went through this deep dive because I believe in Tangled and want it to be it's best so I'm happy you wanna take on my suggestions.

which for now could mean authoring meaningful record lexicons; but down the line: XRPC queries for easy access

This may or may not be what I see as the first piece in a series of ideas and things to be changed and/or ironed out (including XRPC queries) in order to "atproto-ify" and generally open up the Tangled architecture and design in order to support third party clients and other such niceties ;)

I agree with this. I do want users to be able to refer to issues on the tangled appview using short identifiers. As with all "names"; the holy trifecta would be globally unique, memorable and resolvable; integers are only unique and memorable but can only be resolved in the context of our appview. In the atmosphere however: the shortest globally unique name is an aturi; which is, given we know the NSID already, the pair consisting of (did, rkey). I think the changes we want to make here are: to remove the bespoke IDs for issues and comments from the record, but invent some form of ID to preserve the "ticket" linking mechanism, if only on the appview. Something I can think of right away: is to use short issue/pr IDs on the appview side when a record is indexed, but also offer permalinks by using the did+rkey combo.

I really like that idea! Kind of mirroring the handle <-> did dichotomy. Identifier mappings that are locally unique and resolvable in the context of a Tangled AppView. Like you say I think the best approach and the way to keep the pros of both these local IDs and at-uri's is to keep the IDs as an entirely AppView and client UI level concept and following the approach the BlueSky has taken to handling handles. In the sense that the fully "canonicalised" references are used in general for things like URLs, references in records, etc with allowed usage (outside records) of the local IDs for ease of use. So for example the web UI URL for a given issue would be https://tangled.sh/<owner did>/<repo name>/issues/<issue owner did>/<issue record> where the DIDs can to be exchanged for handles and the whole <issue owner did>/<issue record> part can be exchanged for the local identifier. Not necessarily in that format but that kind of thing. Keeping it completely out of records would also include things like "rewriting" their use in comments and issue bodies and the like before submitting them to PDSs. In a way these local human friendly IDs would be a kind of Tangled side "rendering" of the at-uri.

Format wise I think there's a few constraints it needs to handle. Mainly:

Be short, memorable and descriptive. Each component of the ID should serve a clear function that is easy and intuitive for a human to parse, understand and recreate. ("short" here meaning that it is as short as possible in it's longest form but also allowing dropping implied parts of the ID if you for example are referencing an issue in the same repo)
Support cross repo references and references across "types" (referencing an issue in a PR etc)

Something more or less simple like #[[<owner>/]<repo>/]<issue|pull>/<numeric id> is most likely fine but it's probably a good idea to see if any ideas for a better format crops up. The <issue|pull> part could be omitted if the numeric IDs for each are combined a la GitHub but I've seen quite a few people be confused by that in the past and also think it's cleaner to have the two separate sets of IDs like they are currently.

I agree on this one too! IMO I am still unsure if we want total threading in issues (comments on comments), or if we want this to be a linear thread. In the latter case (which is currently live), I would want to have comments strong-ref to the issue itself. In general, I agree, this would help crawl the network seamlessly.

A strong ref to the issue itself would absolutely be great. I think making it possible to determine the order of comments while enforcing linear threads is going to be hard since that would necessarily include a strong ref to the "parent" comment in which case two records can claim to both be the comment following another comment and now the AppView has to decide which is "correct" and at that point you open up for issues where people can purpose make records that mess with the algorithm that decides what record should be considered "correct". I'm personally all for fully fledge tree style comments since in my opinion they make it easier to be specific about what you're responding to and easier to keep context across a group of raised concerns. I admit I am a bit biased here though so it would probably make sense to try and gauge community sentiment here.

Thanks for the detail on this one. I had somehow assumed that getRecord had a way to enumerate CIDs, but it seems that is not possible; it does seem possible to get one record at a given CID. I think I understand the issues presented here, and a few constraints emerge for a redesign:

Some way to allow multiple users to "contribute" to a PR

Distinction between rounds; preferably two records: sh.tangled.repo.pull and sh.tangled.repo.pull.round

Some way to order rounds within a PR, so it is obvious how to rebuild this from a listRecords like output

I think a unique record makes most sense. Alternatively, if we could work in a way to enumerate CIDs into the official APIs, that would be neat.

I definitely think dedicated records make the most sense. Not only does using CIDs for this kind of thing go a bit against the point CIDs being there but if you want to allow multiple users to submit rounds/submissions to a PR that will be much cleaner to do with a dedicated record. I love the idea of allowing multiple users to submit on a PR though! I think this also opens up for a discussion about authorisation in Tangled as a whole. It might make sense to have a dedicated set of permission records (or permission records in each "section" or some other third way to handle it) defining whitelists for who is allowed to do what in which scopes (who is allowed to make issues on a repo, who is allowed to comment on said issue, who is allowed to submit to a PR and even stuff like who's allowed to make repos on a specific knot all those kinds of things). Probably some implementation of roles too and wrapping the handling of collaborators into that system too (a collaborator could in theory just be any person that has been given the permissions to push to a repo and/or manage permission for a repo).

On the ordering front I think just a strong ref to the previous round should be enough. Theoretically that would be vulnerable to the same kind of issues as described above with enforcing linearity (which I do think makes a lot of sense to stick to here. Linearity is part of the core idea of the PR rounds) but I think the concern is lessened here since only records published by people who have in some way been given permission to publish submission to the PR would even be considered for the ordering and from there some simple fallback like first createdAt just to avoid messy situations if people accidentally publish more than one submission as a child of a round.

Having the PR and it's rounds as separate records also opens the door for stuff like having different types of rounds. Perhaps sh.tangled.repo.pull.patch for PRv1 patches and sh.tangled.repo.pull.submission for PRv2. That could also make it possible to seamlessly "upgrade" a patch to a fully fledged submission if a change set starts to grow too big to be a simple patch. I don't know if that's going to end up being desirable I'm just throwing ideas out at this point.

I think what makes sense from here is to start drafting up new versions of the simpler lexicons (primarily the sh.tangled.* ones) and look in detail at how to best handle the migration of already indexed records and implementing all that. Lot of stuff. I'm gonna start on drafting the lexicons themselves and see where to go from there once those are ready for review.

Labels

None yet.

area

None yet.

assignee

None yet.

Participants 2

Referenced by

#399 API

AT URI

at://did:plc:h5wsnqetncv6lu2weom35lg2/sh.tangled.repo.issue/3ll5cut5gym22