NOMN

← 工具

NOMN:微时序增强器

常见问题
What is NOMN actually doing to audio?
NOMN adds the temporal microstructure that natural acoustic sources have and that digital playback doesn't. A violinist's bowing, a vocalist's phrasing, a drummer's microtiming, the mechanical drift of any physical instrument: these produce small, structured variations in event timing that the auditory system has co-evolved with for hundreds of thousands of years. The variations enter the auditory system as part of what neuroscience calls **temporal fine structure**, the sub-millisecond waveform information the cochlea passes to the auditory nerve, which the brain uses for pitch perception, source identification, and the perception of naturalness (see TFS questions below for more resolution).

Digital playback runs on a crystal-locked clock whose timing stability is orders of magnitude tighter than any natural acoustic source. Crystals have measurable phase noise and jitter. We're not claiming they don't, but those deviations are vanishingly small and statistically structureless compared to the rich temporal variation any physical sound source produces. There has never, in the natural history of hearing, been a sound source so temporally rigid.

NOMN puts back the kind of variation that grid-locked playback removed. Not as random noise, not as a recognizable effect, but as structured temporal patterning that the auditory system reads as natural rather than mechanical.
Isn't this just an advanced tremolo or a fancy chorus?
At the DSP-primitive level, timing modulation has been around forever. Tape wow and flutter has been doing this to analog audio since the 1930s. Granular synthesis has been operating at sub-10ms scales since the 1970s. Every chorus plugin since the 1990s has been doing sub-sample-resolution time-varying modulation. We're not claiming to have invented modulation :)

What's new is what's driving it and what it does to the human:machine relation for audio.

A tremolo's control signal is a 2-parameter LFO. A chorus is a 4-6 parameter LFO. A humanizer plugin is filtered random noise. Tape emulation is noise shaped to match measured wow/flutter spectra form vintage gear. All of these are content-blind and aren't modeled from the body, they're modeled from affective technologic nostalgia.
Music cognition research says the smallest perceptible timing difference is something like 10-50ms. Doesn't that mean NOMN's microsecond-scale modulation is below audibility / "Just Noticeable Difference" (JND) threshold and thus handwavy audiophile nonsense like wildly expensive speaker cable or something?
This is the most common version of the audibility critique, and it gets the question backwards.

First, on what the JND literature actually measures. JND (just-noticeable-difference) thresholds for musical timing, the ones in the 10-50ms range, measure how much one note has to move relative to another before a listener can consciously identify the shift in a forced-choice cognitive task. That tells you when timing becomes *labelable* as different. It does not tell you the resolution at which the auditory system processes time or what we sense.

The auditory system's actual temporal resolution is roughly three to four orders of magnitude finer than musical JND. The two most established lines of evidence:

The binaural pathway resolves interaural time differences down to about 10 microseconds. Klumpp & Eady (1956, J. Acoust. Soc. Am. 28: 859-860) measured average ITD discrimination thresholds of 9μs for band-limited noise and 11μs for a 1000-Hz tone across ten listeners. These thresholds have been independently reproduced for nearly seventy years. Brughera, Dunai & Hartmann (2013, J. Acoust. Soc. Am. 133: 2839-2855) confirmed thresholds just above 10μs at 700-1000 Hz using modern methods. The lowest measured thresholds approach the single-microsecond range under optimal conditions. The mechanism is well-understood: neurons in the medial superior olive perform coincidence detection on phase-locked spikes from each ear. The largest ITD anyone normally encounters, for a sound directly to one side, is around 600-700μs, set by the distance between the ears (Mills 1958, J. Acoust. Soc. Am. 30: 237-246). Listeners reliably resolve angular differences of about 1 degree near the midline. Note that most of this research is already 70+ years old!

The monaural pathway encodes the sub-millisecond structure of sounds through what auditory neuroscience calls **temporal fine structure (TFS)**, the rapid waveform oscillations within each cochlear frequency band, as distinct from the slower envelope (ENV) modulations superimposed on them (Moore 2008, J. Assoc. Res. Otolaryngol. 9: 399-406, the canonical review). TFS information is carried in the timing of auditory-nerve-fiber spikes that phase-lock to individual cycles of the stimulus waveform for low-frequency components up to several kilohertz. This isn't a hypothesis or a contested claim, it is the standard model of how the auditory periphery encodes time, reviewed comprehensively in Joris, Schreiner & Rees (2004, Physiological Reviews 84: 541-577).

TFS is what the auditory system uses for pitch perception of complex tones, for the perception of speech in fluctuating background noise, and for source separation in complex acoustic environments. Smith, Delgutte & Oxenham (2002, Nature 416: 87-90) demonstrated this directly by constructing "chimaeric" sounds in which the envelope of one signal was combined with the TFS of another. Listeners reliably perceived pitch and source location based on the TFS, not the envelope. TFS isn't specific to live sound, binaural listening, or any particular playback situation. It operates on whatever the cochlea receives, including the output of headphones and speakers playing recorded music. When you listen to a recording, the temporal fine structure of the audio is encoded into the spike timing of your auditory nerve at sub-millisecond resolution. This processing happens continuously, below the threshold of conscious awareness, which is exactly why musical JND studies don't measure it. JND measures what listeners can report. It doesn't measure what their auditory systems are doing.

The more important point. **The right question isn't whether listeners can A/B-distinguish two audio files in a controlled trial. The right question is whether the technology that generates audio for human consumption should operate at the resolution of the sensory system it's serving.**

The audio industry has answered this question consistently for decades. Studios record at 96kHz or 192kHz not because listeners can reliably A/B-distinguish those rates from 48kHz on every track, but because the production chain shouldn't have artifacts introduced at the resolution end of the system. Mastering engineers obsess over jitter specifications in word clocks that operate well below classical audibility thresholds, because they don't want the clock to be the bottleneck. Professional audio interfaces compete on sub-millisecond round-trip latency. The principle is consistent: human-facing audio technology should operate above the sensory floor, not below it.

NOMN sits in this lineage. Crystal-locked playback timing is acoustically unprecedented in the natural history of hearing. There has never been a sound source with this little temporal variation. The question isn't whether listeners can articulate the difference in a forced-choice test on a per-track basis. The question is whether AI-generated audio at scale, intended for billions of hours of human listening, should match the temporal resolution the sensory system actually uses. We think it should. The audio industry has historically agreed with that principle for every other dimension of the playback chain: sample rate, bit depth, jitter, latency, frequency response, distortion. Treating the temporal microstructure dimension as the lone exception, just because the relevant variation sits below conscious labelling threshold, is inconsistent.

If the audibility critique held, if anything below conscious JND were perceptually irrelevant, listeners couldn't localize sound sources, couldn't separate voices in a crowd, couldn't tell a real violin from a sampled violin played through the same speaker. All of those judgments depend on temporal resolution far finer than musical JND.
OK so this is all pretty interesting, but what's temporal fine structure, exactly, and where does NOMN sit relative to the established TFS literature?
Temporal fine structure (TFS) is the standard technical term in auditory neuroscience for the rapid sub-millisecond waveform information the cochlea passes to the auditory nerve, as distinct from the slower envelope (ENV) information riding on top of it. The cochlea decomposes broadband sound into narrowband signals via auditory filtering, and each of those narrowband signals can be characterized as a slow-varying envelope superimposed on a faster carrier: the fine structure. Both kinds of information are encoded in the timing of auditory-nerve spikes, but they're carried differently. ENV through changes in firing rate, TFS through phase-locking to individual cycles of the waveform.

The TFS framework has been extensively developed in the auditory science literature over the past two decades. Moore (2008, J. Assoc. Res. Otolaryngol. 9: 399-406) is the standard review of TFS's role in pitch perception, masking, and speech perception. Smith, Delgutte & Oxenham (2002, Nature 416: 87-90) used "chimaeric" sounds, constructed by combining the envelope of one signal with the TFS of another, to demonstrate that listeners rely on TFS for pitch and source localization while relying on ENV for speech recognition in quiet. Subsequent work (Lorenzi et al. 2006, PNAS 103: 18866-18869; Hopkins & Moore 2009, J. Acoust. Soc. Am. 125: 442-446) has shown that TFS sensitivity is critical for speech perception in noisy environments, and that hearing-impaired listeners' reduced sensitivity to TFS is a major factor in their difficulty understanding speech in noise.

This matters for NOMN in two ways.

First, TFS is the established technical vocabulary for what NOMN operates on. The temporal microstructure NOMN restores to digital playback is, in the technical language of the field, modulation in the temporal fine structure of the audio signal. We aren't making up a new perceptual category. We're operating in a well-mapped region of the auditory science literature.

Second, the existing TFS research focuses primarily on what's *lost*. How hearing-impaired listeners lose TFS sensitivity, how cochlear implants struggle to deliver TFS information, how aging degrades TFS processing. NOMN approaches the question from the other direction: what kind of TFS structure should well-engineered playback technology preserve and present to listeners whose TFS processing is intact? The auditory science community has spent two decades documenting how much TFS matters for normal hearing. The audio industry has not yet drawn the corresponding conclusion about playback technology design. NOMN is one application of that conclusion.

A note on scope. The "fine structure" in TFS refers to the rapid carrier oscillation within auditory filter bands, which is encoded at sub-millisecond resolution via phase-locking up to several kilohertz. NOMN's modulation operates across a range from microsecond to millisecond scales, modulating the temporal structure of the audio content itself. Both sit in the temporal regime where the auditory system does fine-grained timing work. We use the broader phrase "temporal microstructure" in marketing copy to avoid claiming we directly manipulate the specific signal-processing quantity that TFS researchers technically measure with the Hilbert decomposition, but the perceptual mechanism we're targeting is the same one that TFS research has been documenting since the early 2000s.
If sub-JND timing differences don't matter, why does the audio industry spend so much effort minimizing latency?
It doesn't, and this is the cleanest illustration of the human:machine framing we just made.

Every working musician who records with a DAW tunes their audio buffer size to keep round-trip latency as low as possible. Professional audio interfaces compete on sub-millisecond round-trip latency. The Bela platform was specifically built to achieve sub-millisecond action-to-sound latency for digital musical instruments (McPherson, Jack & Moro 2016, Proc. NIME) because most common platforms fail to meet the targets professional musicians need.

The peer-reviewed evidence on what musicians actually feel is clear. Jack, Mehrabi, Stockman & McPherson (2018, Music Perception 36: 109-128) tested professional percussionists and amateur musicians on a digital percussion instrument with controlled latency conditions of 0ms, 10ms, 10ms ± 3ms jitter, and 20ms. Both groups rated zero-latency as significantly higher quality than the 10ms-with-jitter and 20ms conditions. Professional percussionists were more sensitive to latency than amateurs and showed measurable changes in timing performance under added latency. Schmid et al. (2024, Proc. Mensch und Computer, ACM) measured just-noticeable-difference for added audio latency across 37 listeners and found a mean JND of 27ms at 64ms base latency, with musically sophisticated participants reliably detecting smaller margins. Earlier ensemble work documented that asynchronies up to 50ms occur in real performances (Rasch 1979, Acustica 43: 121-131) and that professional percussionists exhibit timing jitter of 10-40ms even when synchronizing to a metronome (Dahl 2011, Music Perception 28: 491-503).

Acoustic drums have a natural latency of about 2-3ms from stick contact to sound reaching the drummer's ears, a value set by the speed of sound across the distance from the drum to the head. This is the baseline the drummer's nervous system has calibrated to over years of practice. When an electronic drum module introduces an extra 5-10ms on top of this, professional drummers describe the kit as "sluggish," "disconnected," "laggy."

Notice what's happening here. The audio industry has, for decades, accepted the principle that **playback technology should operate at the temporal resolution the sensory system actually uses, not at the resolution of conscious A/B detection**. Nobody argues that audio interfaces should target 50ms latency because that's the conscious JND. The industry targets sub-millisecond because that's where the human:machine interaction breaks down. Studios record at high sample rates so that the production chain isn't the bottleneck. Word clocks are spec'd at jitter levels below classical audibility for the same reason. You don't want the clock to be the lowest-resolution element in the system.

This is exactly the principle NOMN applies. Crystal-locked playback has temporal stability orders of magnitude tighter than any natural acoustic source. The sensory system that consumes the audio resolves timing at microsecond scales. The fact that listeners can't always consciously label what they're hearing in an A/B test doesn't mean the technology should operate below the sensory floor. It means the audio industry should treat temporal microstructure with the same engineering discipline it already applies to sample rate, bit depth, latency, and jitter.
But the speaker cone and the room introduce way more temporal modification than NOMN does. Doesn't that swamp the effect?
In absolute time-magnitude terms, yes. A room's impulse response operates at the millisecond-to-hundreds-of-milliseconds scale. Speaker cone breakup happens at sub-millisecond scales. The acoustic chain introduces more temporal modification than NOMN does.

The relevant difference isn't magnitude. It's structure.

Room and speaker convolution is content-blind and stationary. The room's impulse response is fixed for a given listening position. The reverb tail of a snare hit and the reverb tail of a sustained vocal note get the same room treatment. This is convolution with a fixed kernel, large in magnitude, but content-blind and time-invariant.

The auditory system has well-documented machinery for separating direct-path source signals from reverberant reflections. The foundational finding is the precedence effect, first systematically described by Wallach, Newman & Rosenzweig (1949, American Journal of Psychology 62: 315-336). When two identical sounds arrive at the ears within a few milliseconds of each other, the listener perceives a single fused sound localized at the position of the first-arriving wavefront, with the later-arriving reflections strongly suppressed in their contribution to perceived location. This is why you can localize a speaker in a reverberant room. The brain attributes the spatial cue to the direct sound and treats the reflections as environment. The mechanism extends into the broader framework of auditory scene analysis (Bregman, 1990, MIT Press), in which the auditory system uses primitive grouping cues to organize incoming sound into source representations distinct from environmental context. Subsequent reviews (Litovsky et al. 1999, J. Acoust. Soc. Am. 106: 1633-1654; Brown et al. 2015, J. Acoust. Soc. Am. 137: 776-790) document this is a continuous, automatic process operating below conscious awareness.

What the auditory system *can't* factor out, and uses heavily for source identification and naturalness judgment, is the underlying source's intrinsic timing structure. The room can smear what's there. It can't add what isn't, and it can't subtract what is.

Put simply: a real violin and a sampled violin played through the same speaker in the same room are typically distinguished by listeners on extended listening. The acoustic chain is identical. The difference is in source-level temporal structure that survives the chain because it's encoded in the signal before it ever reaches the speaker.
Doesn't the DAC's reconstruction filter smooth out fast timing modulation anyway?
No, and the reason matters for what we are doing here: NOMN's modulation isn't a separate timing channel for the DAC to filter out. The modulation is encoded in the audio content itself, in which samples contain what energy. The DAC sees a normal audio signal at its native sample rate and applies its usual reconstruction. Whatever filtering the DAC does to the audio, it does identically to NOMN-processed audio and unprocessed audio. The modulation is preserved because it's a property of the content, not metadata that the filter could destroy.

A general principle worth stating clearly: NOMN's modulation is content, not metadata. Anything that processes the audio processes the modulation along with it. Anything that doesn't process the audio can't touch the modulation. There's no separate channel to attack. The same logic applies to the speaker, the room, the listener's HRTF, the ear canal. All linear time-invariant operations applied to the audio content, none of which selectively erase the modulation.
Couldn't you accomplish the same thing with a low-depth chorus or some filtered noise driving varispeed?
You could accomplish *some* of it. Audio engineers have known for decades that adding subtle temporal variation makes digital audio sound less mechanical. Tape emulation plugins, subtle chorus, and pitch modulation are all in this lineage. We're not denying that any structured temporal variation is better than none.

The difference is in what the auditory system does with different kinds of variation. LFO-driven modulation is periodic, and the auditory system detects periodicity below conscious awareness. Subtle periodic modulation reads as "wobbly" or "effected" even when listeners can't say why. Filtered noise modulation is aperiodic but content-blind, which the auditory system also reads as foreign to natural sources, since natural sources don't produce statistically white timing variation. Natural timing variation has specific structure: long-range correlations and content correlation that have been measured directly in human performance. Hennig (2014, PNAS 111: 12974-12979) documented that timing deviations in professional drum performances exhibit long-range (1/f-type) correlations rather than white-noise statistics, a finding consistent with broader work on temporal structure in human motor performance (Gilden, Thornton & Mallon 1995, Science 267: 1837-1839). The closer your modulation matches this structure, the less the auditory system flags it as alien.

NOMN's modulation matches that structure. A low-depth chorus or 1/f noise doesn't.
Hasn't this been tried before? Isn't NOMN just like MQA or C Wave or one of those audiophile dead ends?
Our intentions are quite different nor are our claims so extreme. MQA tried to fix time-domain artifacts in the encoding/decoding chain itself, marketed lossy compression as lossless, required proprietary decoders, and treated independent measurement as an adversary. It collapsed under sustained technical critique. NOMN doesn't touch the encoding chain. We add temporal microstructure at playback, downstream of reconstruction, with conventional architecture. We think it would be great if NOMN became integrated into hardware and streaming clients and some mastering engineers found it compelling enough as a final touch.

C Wave argues that PCM is "non-continuous" and that the brain detects this discontinuity. Their solution is kinda reverb to "fill in gaps." We don't share that diagnosis. A reverb algorithm running on PCM is still PCM, and Shannon-Nyquist guarantees that properly bandlimited PCM is mathematically equivalent to a continuous waveform up to the Nyquist frequency. There are no gaps to fill in the digital signal. We're not claiming to fix something inside PCM. We're claiming that natural acoustic sources have temporal microstructure that crystal-locked playback lacks, which is a different claim, one grounded in the physical properties of natural sound sources rather than in disputed claims about sampling theory.

The single biggest lesson from those efforts: don't pick fights with sampling theory, don't claim what you can't measure, and don't treat independent measurement as an enemy.
How is this different from a humanizer plugin?
Humanizer plugins use random number generators to add timing variation to MIDI events. They've existed since the early 1990s, and they help. That's why every DAW has one.

Two differences. First, humanizers add stochastic variation. NOMN adds structured variation matched to natural source statistics. Random isn't the same as natural. The long-range correlation structure documented in human motor timing (Gilden et al. 1995; Hennig 2014) is categorically different from the white-noise distribution most humanizers produce, and the auditory system responds to that distinction.

Second, humanizers operate on MIDI event timing before audio rendering. NOMN operates on audio at the signal level. A humanizer on a quantized MIDI snare moves the hit. NOMN modulates the playback of the audio itself. Different operations, different signal-chain positions, different effects. A humanizer can't humanize a finished audio file. NOMN can.
Is the temporal modulation audible?
This is the wrong question, and the way it's usually asked is part of why the audibility debate in audio has been unproductive for so long.

If you mean "can a listener identify NOMN as a recognizable effect," generally no, and that's the design intent. A flanger that wasn't audible would be failing at its purpose. NOMN that was audible as processing would be failing at its purpose. They're aiming at opposite outcomes.

If you mean "would a listener succeed at A/B-distinguishing NOMN-processed audio from unprocessed audio in a controlled trial," that's an empirical question we'd love to investigate with proper perceptual research, and when we can fund that study and publish the results, we will. It's also not the question that decides whether the technology matters or is worth pursuing or supporting.

The relevant question is the one the audio industry has been answering for decades on every other dimension of the playback chain: does the technology operate at the temporal resolution the sensory system actually uses? For sample rate, bit depth, latency, jitter, and frequency response, the industry has consistently answered yes. The production chain should match the sensory floor, not the conscious A/B detection threshold. We're applying the same engineering discipline to temporal microstructure. Whether a listener can articulate the difference in a forced-choice test on a per-track basis is a different question from whether the technology serving billions of hours of human listening should match the sensory resolution.
Why is it called NOMN? Is the last N a silent N?
The name derives from "metronome." It also reads as "no man." We treat that dual resonance, between mechanical timekeeping and human-produced variation, as productive rather than something to resolve.
Where can I read more about the auditory science you're citing?
The claims in this FAQ about how the auditory system processes time aren't ours. They're standard neuroscience, and we've cited the canonical sources so anyone can verify what we're working from. The full set:

INTERAURAL TIME DIFFERENCE THRESHOLDS

— Klumpp, R.G. & Eady, H.R. (1956). "Some Measurements of Interaural Time Difference Thresholds." Journal of the Acoustical Society of America 28(5): 859-860. The original measurement: 9μs threshold for band-limited noise, 11μs for 1000-Hz tone, 28μs for clicks (75% correct discrimination, ten listeners).

— Mills, A.W. (1958). "On the Minimum Audible Angle." Journal of the Acoustical Society of America 30(4): 237-246. Foundational measurement of angular acuity in sound localization (~1° near midline).

— Brughera, A., Dunai, L. & Hartmann, W.M. (2013). "Human interaural time difference thresholds for sine tones: The high-frequency limit." Journal of the Acoustical Society of America 133(5): 2839-2855. Modern confirmation of ~10μs thresholds for pure tones at mid-frequencies, with high-frequency cutoff around 1.4 kHz.

NEURAL CODING OF TEMPORAL STRUCTURE

— Joris, P.X., Schreiner, C.E. & Rees, A. (2004). "Neural Processing of Amplitude-Modulated Sounds." Physiological Reviews 84(2): 541-577. The standard review on how the auditory system encodes temporal modulation for source localization, identification, and parsing.

— Moore, B.C.J. (2008). "The role of temporal fine structure processing in pitch perception, masking, and speech perception for normal-hearing and hearing-impaired people." Journal of the Association for Research in Otolaryngology 9(4): 399-406. The canonical review of temporal fine structure (TFS) and its perceptual role.

— Smith, Z.M., Delgutte, B. & Oxenham, A.J. (2002). "Chimaeric sounds reveal dichotomies in auditory perception." Nature 416: 87-90. The foundational experimental demonstration that listeners rely on TFS for pitch and localization while ENV dominates speech recognition in quiet.

— Lorenzi, C., Gilbert, G., Carn, H., Garnier, S. & Moore, B.C.J. (2006). "Speech perception problems of the hearing impaired reflect inability to use temporal fine structure." Proceedings of the National Academy of Sciences 103: 18866-18869. Direct evidence for TFS's role in speech-in-noise perception.

SOURCE/ENVIRONMENT SEPARATION

— Wallach, H., Newman, E.B. & Rosenzweig, M.R. (1949). "The Precedence Effect in Sound Localization." American Journal of Psychology 62(3): 315-336. The foundational paper showing that listeners localize sounds based on first-arriving wavefront, suppressing reverberant reflections.

— Bregman, A.S. (1990). Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press. The standard reference text on how the auditory system organizes complex sound mixtures into source representations.

— Litovsky, R.Y., Colburn, H.S., Yost, W.A. & Guzman, S.J. (1999). "The Precedence Effect." Journal of the Acoustical Society of America 106(4): 1633-1654. Comprehensive review of the precedence effect and echo suppression literature.

LATENCY PERCEPTION AND MUSICAL PERFORMANCE

— Jack, R.H., Mehrabi, A., Stockman, T. & McPherson, A. (2018). "Action-sound Latency and the Perceived Quality of Digital Musical Instruments." Music Perception 36(1): 109-128. Professional percussionists rated 10ms±3ms jitter and 20ms latency conditions as significantly lower quality than zero latency.

— McPherson, A., Jack, R. & Moro, G. (2016). "Action-Sound Latency: Are Our Tools Fast Enough?" Proc. NIME 2016. Survey demonstrating most digital musical instrument platforms fail to meet sub-millisecond latency targets; motivates the Bela platform.

— Schmid, A., et al. (2024). "Measuring the Just Noticeable Difference for Audio Latency." Proc. Mensch und Computer 2024 (ACM). Mean JND of 27ms at 64ms base latency, with musically sophisticated listeners detecting smaller margins.

— Dahl, S. (2011). "Striking Movements: A Survey of Motion Analysis of Percussionists." Music Perception 28(5): 491-503. Documentation of percussionist timing variability.

NATURAL TIMING STATISTICS

— Hennig, H. (2014). "Synchronization in human musical rhythms and mutually interacting complex systems." Proceedings of the National Academy of Sciences 111(36): 12974-12979. Direct measurement of 1/f long-range correlations in professional drum performance timing.

— Gilden, D.L., Thornton, T. & Mallon, M.W. (1995). "1/f noise in human cognition." Science 267: 1837-1839. Broader finding of 1/f temporal structure across human cognitive and motor performance.

We cite this work because we want NOMN's perceptual claims to rest on the same foundation as the rest of the auditory science community's. Independent measurement and verification are how this field moves forward, and we don't want to be exempt from that.

人类最快的感官是听觉,差距超过10倍。人耳能检测到10微秒的时间差异。如果你正在阅读的显示器以60hz刷新,那比你的耳朵能分辨的速度慢1500倍。

地球上每一个数字音频源都有一个共同特性:近乎数学上完美的时序。DAW、数字合成器、鼓机、采样器、流媒体音频——所有这些在设计上都是时间刚性的。发烧友使用10MHZ外部时钟追求最大稳定性。"保真度"的定义一直是零频率不稳定性、零时序变化。

与此同时,行业花了五十年优化频谱保真度,构建了一套用于音乐创作和聆听的数字基础设施——其运作精度比它应该服务的系统(即听众)的时间敏感度低了几个数量级。

自然界中的声音从来不会在时间上完美。每一件原声乐器、每一个人声、每一丝穿过环境的风,都展现出源自其产生物理过程的连续微秒级时序变化。这些变化不是缺陷——它们正是听觉系统识别为"活着"的东西。所有音频技术的基石子技术是一个底层周期性,即时钟。无论是被调制的电频率、旋转的蜡筒、唱片刻纹机还是数模转换器,总有一种方法来量化并在整个系统中维持新生成量子的逻辑结构。如果时钟退化,幻觉就会崩塌:就像翻得太慢的翻页动画,感知的黑客手段就会失败。

唱片机和模拟磁带机的声音并不更好——它们的感觉更好。它们是微时序增强器。转盘或磁带走带机构的机械不稳定性在时域中引入了与频率不稳定性耦合的变化。这是人们通过黑胶唱片、真空管和模拟信号链花费巨资追求的品质——往往无法说出自己听到的是什么,因为他们听到的不是频谱层面的东西,而是时间层面的。

NOMN为数字音频恢复时间上的生命力。它是一个微时序增强系统,以人类感知系统的分辨率向任何音频流引入人类结构化的、不重复的时序变化。

--
## 工作原理

NOMN基于80多种语言中人类语音的时间微观结构进行训练。不是音素,不是单词,不是含义,不是音色。仅仅是使生物交流感觉"活着"的微观时序模式。来自多样语言传统的模式被提炼为有机时间行为的生成模型。

运行时,系统产生连续的时序变化流——每秒超过1,000次更新——并将其应用于输入音频。原始内容完整保留。信号中不添加也不移除任何东西。仅在低于swing或groove阈值但在感知效果阈值之内的分辨率上丰富时间微观结构。

这些变化不是随机的,无法用jitter复制。它们不是周期性的。它们不循环。它们是上下文结构化的、不重复的——为通过的每一刻音频实时生成。

--
## API

作为首次发布,NOMN以云处理服务的形式提供。发送音频,接收经过时间增强的音频。

API接受标准格式的音频并返回处理后的输出。控制参数是可选的——提供时,允许导航系统内部的时序行为空间。省略时,系统自动确定输入素材的最佳增强方案,实时调整以最大化感知效果同时保持完全透明。

处理以高采样率和亚毫秒时间分辨率运行。延迟取决于配置,适用于母带处理、后期制作和批处理工作流。流媒体应用可使用近实时配置。

--
## 使用场景

母带处理与后期制作
与EQ、压缩、空间处理和响度正交的音频增强新维度。适用于任何母带、任何流派、任何录音时代。

流媒体与播放
可部署为流媒体基础设施或播放设备中的实时处理层。在不修改内容的情况下增强通过的所有音频——音乐、播客、电影音频。

硬件集成
系统的计算足迹足够小,可嵌入部署在音频DSP芯片上——小到可装入耳机、车载主机和便携播放器。可授权集成到消费级音频硬件、汽车音频系统和专业设备中。

--
## NOMN不是什么

NOMN不是均衡器、压缩器、空间处理器或效果器。它不修改频率内容、动态范围、立体声像或响度。它不添加谐波、噪声或饱和。

它在现有工具都未触及的音频维度上运作:使音频得以作为感知黑客手段运作的时间微观结构。

--
## 技术说明

NOMN的时序变化在微秒尺度上运作——与模拟播放系统的时序不稳定性处于同一数量级,但是结构化的而非机械的,不重复的而非周期性的。

系统包含持续的质量验证,监控预期时序与渲染时序之间的关系,确保增强效果在从处理到输出的完整信号链中得以保持。零差测试分析确认增强在频谱上是透明的——输入与输出之间唯一可测量的差异在时域中。

--
## 格式与访问

API: RESTful HTTP端点。发送音频,接收处理后的音频。可选控制参数。自动模式可用。

授权: 可用于集成到硬件、软件和流媒体基础设施中。按设备、按曲目或企业授权模式。

专利状态: 专利申请中(日本,2026年)。POLYTOPE KK。

--
## 关于微妙性

效果在设计上是微妙的。它不是像EQ那样能听到的离散变化——而是音频作为时间体验的感受方式的质性转变。音频一直通过利用耳朵的时间分辨率来运作:足够快的时钟超越感知辨别力,产生连续性的幻觉。NOMN在同一阈值上运作——不是通过降低时钟品质,而是赋予它一种结构化的不稳定性:声学和机械系统一直具有、而数字系统已经消除的那种不稳定性。

这对特定听众、特定录音、特定播放链是否重要,是一个经验问题,而非修辞问题。我们不对你会感受到什么做出断言——但我们感受到了,也希望你能感受到。